-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan Implementation #2059
Vulkan Implementation #2059
Conversation
About this, splitting the computation between the CPU and the GPU can also be achieved with a Metal-like implementation that runs entire graphs at a time, it is just a matter of splitting the graph into multiple parts and copying the output of each fragment to the input of the next. I think that this will be the cleanest solution in the long run, it will simplify the implementation of the backends, and eventually it will allow us to have a common interface to them. This will make possible mixing different backends, not just CPU+CUDA or CPU+Vulkan. For example, if you have an NVIDIA and an AMD GPU, you could run some layers with CUDA and the rest with Vulkan, possibly with some parts on the CPU as well. My goal currently is to adapt the CUDA implementation to work in this way. |
Yes, that is how it's done on Pytorch transformers inference usually. I suppose that would be interesting, but what would be the advantage over running it on Nvidia and AMD GPUs together using a vendor-neutral library? |
The advantage is better performance on NVIDIA GPUs. |
I'm not convinced that is true universally. OpenCL has its limitations (some artificial ones introduced by Nvidia too, like lack of FP16 support) that prevent it from matching CUDA, but Vulkan for example is very well supported by Nvidia, it's just harder to write. We'll see where it goes. |
As a random point of reference, mlc-llm's 7B llama vulkan implementation is faster than GPTQ CUDA on my mobile 2060. Its apples to oranges, but still evidence that Nvidia Vulkan can be performant. Also, a big desire of mine (and I'm sure many others) is to offload to IGPs in a way thats more performant than pure CPU. Is that in the scope of this PR, or is it mostly targeting dGPUs? |
Probably not this PR, it's hard enough to get the basics set up. But Vulkan gives you a lot of freedom with memory management, so optimizing for iGPUs can certainly happen in the future. |
No, at the moment that's not possible. The layers are either on the CPU or the GPU and swapping that around is costly. OpenBLAS needs them on the CPU, Vulkan needs them on the GPU. But I have plans to improve Vulkan prompt processing performance at a later point.
Yeah, the continue op is not behaving as it should. I'll fix it when I get to it.
I get reports of this every now and then, but I haven't been able to reproduce it. Can you build with |
Not at the moment, but there's a Vulkan extension for that. I'll try it sometime maybe. 64 is the default AMD gave you, there's probably a reason for that. |
What is "warp" in this context? I have no idea what it means. |
Basically it's the number of threads that execute the same command together at the same time, on a GPU. On Nvidia it's 32 at a time, on AMD GCN it's 64 and AMD RDNA can do either 32 or 64. Here's a more detailed explanation. |
@0cc4m Need to apply this patch to fix the SYCL build: diff --git a/ggml-sycl.cpp b/ggml-sycl.cpp
index 9764d9c3..3fc34697 100644
--- a/ggml-sycl.cpp
+++ b/ggml-sycl.cpp
@@ -14781,6 +14781,7 @@ static ggml_backend_buffer_type_i ggml_backend_sycl_buffer_type_interface = {
/* .get_name = */ ggml_backend_sycl_buffer_type_name,
/* .alloc_buffer = */ ggml_backend_sycl_buffer_type_alloc_buffer,
/* .get_alignment = */ ggml_backend_sycl_buffer_type_get_alignment,
+ /* .get_max_size = */ NULL, // TODO: return device.maxBufferLength
/* .get_alloc_size = */ ggml_backend_sycl_buffer_type_get_alloc_size,
/* .supports_backend = */ ggml_backend_sycl_buffer_type_supports_backend,
/* .is_host = */ nullptr,
@@ -14844,6 +14845,7 @@ ggml_backend_buffer_type_t ggml_backend_sycl_host_buffer_type() {
/* .get_name = */ ggml_backend_sycl_host_buffer_type_name,
/* .alloc_buffer = */ ggml_backend_sycl_host_buffer_type_alloc_buffer,
/* .get_alignment = */ ggml_backend_cpu_buffer_type()->iface.get_alignment,
+ /* .get_max_size = */ NULL, // TODO: return device.maxBufferLength
/* .get_alloc_size = */ ggml_backend_cpu_buffer_type()->iface.get_alloc_size,
/* .supports_backend = */ ggml_backend_cpu_buffer_type()->iface.supports_backend,
/* .is_host = */ ggml_backend_cpu_buffer_type()->iface.is_host, |
Co-authored-by: Georgi Gerganov <[email protected]>
@ggerganov Done. |
So I'm getting an extra 15% increase in prompt processing speed with warp size 64 on my GCN2 card, with no change in inference speed.
What's the point of doing that? OpenBLAS is really only used for prompt processing and with the Vulkan backend enabled all prompt processing is done on the GPU regardless of how many layers you offload. For actual text generation OpenBLAS is not used since it has a lot of overhead (you have to feed it a lot of tokens to make it worthwhile). |
You mean speed increased after my GCN optimization or something else?
Only the large matrix multiplications are done by Vulkan on CPU layers, the rest is still done by the CPU. |
Yep it did, with the optimization prompt processing runs at 102 t/s with Mistral Q6_K versus 90 t/s with the previous commit.
Ah I didn't know that. When I saw the prompt processing speed drop with partial offloading I always assumed it was due to all the data transfers between the GPU and CPU. Oh and congrats on the merge @0cc4m! 🥳 |
vulkan and rocm is surprisingly close when it come to token gen |
Continuing my Pi 5 tests with the drivers now seemingly working, I only get the following line when initializing which then hangs with one cpu thread pinned to max: ggml_vulkan: Using V3D 7.1.7 | fp16: 0 | warp size: 16 Testing with a 4_K_S model so the missing fp16 support shouldn't be an issue I hope? I'm told that the driver may be missing some features, maybe that's more of a problem. |
Yeah, exactly the same issue here. |
Can you achieve the expected results for Qwen model with vulkan backend inference? The results on my GPU are incorrect. |
I haven't really tested the vulkan build in-depth since last week. I wanted to experiment with a bit tonight. I usually prefer mistral. It took down my entire desktop environment while experimenting with the server. Build: make LLAMA_VULKAN=1 Command: # Edit: I'm fairly certain it was q8_0 that crashed for me.
# I had to create a new f16 afterwards, so it wasn't the one used when crashed.
./server -m local/models/mistralai/Mistral-7B-Instruct-v0.2/ggml-model-q8_0.gguf --n-gpu-layers 16 Results: My cousins RTX 4060ti always generated complete gibberish. Using Mixed results with my RX 580. Using I was using Continue as a front-end client in Visual Studio Code when it crashed and took down the display server. journalctl -xbJan 29 00:10:57 spectra systemd[1]: Started Process Core Dump (PID 1343945/UID 0).
░░ Subject: A start job for unit [email protected] has finished successfully
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit [email protected] has finished successfully.
░░
░░ The job identifier is 13918.
Jan 29 00:10:57 spectra systemd-coredump[1343917]: [🡕] Process 1337387 (server) of user 1000 dumped core.
Stack trace of thread 1337387:
#0 0x00007687f8a8783c n/a (libc.so.6 + 0x8e83c)
#1 0x00007687f8a37668 raise (libc.so.6 + 0x3e668)
#2 0x00007687f8a1f4b8 abort (libc.so.6 + 0x264b8)
#3 0x00007687f8c9ca6f _ZN9__gnu_cxx27__verbose_terminate_handlerEv (libstdc++.so.6 + 0x9ca6f)
#4 0x00007687f8cb011c _ZN10__cxxabiv111__terminateEPFvvE (libstdc++.so.6 + 0xb011c)
#5 0x00007687f8cb0189 _ZSt9terminatev (libstdc++.so.6 + 0xb0189)
#6 0x00007687f8cb03ed __cxa_throw (libstdc++.so.6 + 0xb03ed)
#7 0x00006113751ff77a ggml_vk_compute_forward.cold (server + 0x3a77a)
#8 0x00006113753a918b _ZL29ggml_backend_vk_graph_computeP12ggml_backendP11ggml_cgraph (server + 0x1e418b)
#9 0x00006113753beffa ggml_backend_sched_graph_compute (server + 0x1f9ffa)
#10 0x0000611375335ae1 _ZL21llama_decode_internalR13llama_context11llama_batch (server + 0x170ae1)
#11 0x0000611375336811 llama_decode (server + 0x171811)
#12 0x0000611375280844 _ZN20llama_server_context12update_slotsEv.isra.0 (server + 0xbb844)
#13 0x0000611375276d6f _ZN18llama_server_queue10start_loopEv (server + 0xb1d6f)
#14 0x00006113752121c2 main (server + 0x4d1c2)
#15 0x00007687f8a20cd0 n/a (libc.so.6 + 0x27cd0)
#16 0x00007687f8a20d8a __libc_start_main (libc.so.6 + 0x27d8a)
#17 0x0000611375218d35 _start (server + 0x53d35)
Stack trace of thread 1337392:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
#3 0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
#4 0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#5 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#6 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337393:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
#3 0x000061137522a74b _ZZZ4mainENKUlRKN7httplib7RequestERNS_8ResponseEE8_clES2_S4_ENKUlmRNS_8DataSinkEE_c>
#4 0x00006113752347cf _ZNSt17_Function_handlerIFbmmRN7httplib8DataSinkEENS0_6detail22ContentProviderAdapt>
#5 0x000061137525b344 _ZN7httplib6Server19write_response_coreERNS_6StreamEbRKNS_7RequestERNS_8ResponseEb >
#6 0x0000611375298189 _ZN7httplib6Server15process_requestERNS_6StreamEbRbRKSt8functionIFvRNS_7RequestEEE >
#7 0x00006113752996fe _ZN7httplib6Server24process_and_close_socketEi (server + 0xd46fe)
#8 0x000061137523fb2d _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
#9 0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#10 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#11 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337397:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
#3 0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
#4 0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#5 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#6 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337395:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
#3 0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
#4 0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#5 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#6 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337398:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
#3 0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
#4 0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#5 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#6 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337388:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f803792c n/a (libvulkan_radeon.so + 0x23792c)
#3 0x00007687f80484bc n/a (libvulkan_radeon.so + 0x2484bc)
#4 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#5 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337390:
#0 0x00007687f8b0b30f accept (libc.so.6 + 0x11230f)
#1 0x000061137523d1b0 _ZN7httplib6Server15listen_internalEv (server + 0x781b0)
#2 0x000061137521c3e8 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZ4mainEUlvE_EEEEE6_M_runEv (server>
#3 0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#4 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#5 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337400:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
#3 0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
#4 0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#5 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#6 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337396:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
#3 0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
#4 0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#5 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#6 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337391:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
#3 0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
#4 0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#5 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#6 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337399:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
#3 0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
#4 0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#5 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#6 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337401:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
#3 0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
#4 0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#5 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#6 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337394:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
#3 0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
#4 0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#5 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#6 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337403:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
#3 0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
#4 0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#5 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#6 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337402:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
#3 0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
#4 0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#5 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#6 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337404:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
#3 0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
#4 0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#5 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#6 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337389:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f803792c n/a (libvulkan_radeon.so + 0x23792c)
#3 0x00007687f80484bc n/a (libvulkan_radeon.so + 0x2484bc)
#4 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#5 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
Stack trace of thread 1337405:
#0 0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
#1 0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
#2 0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
#3 0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
#4 0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
#5 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
#6 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
ELF object binary architecture: AMD x86-64 This continues on until D-BUS is hit, stops, and then restarts. Might be related to #5179. Was able to reproduce garbled output which seems to be a mix of Dutch, Russian, Mandrin (Kanji?), English, and some other languages. output./main -m local/models/mistralai/Mistral-7B-Instruct-v0.2/ggml-model-f16.gguf --color -e -s 1337 -c 8192 -n 1024 --n-gpu-layers 16 -p "<<SYS>> My name is Mistral. I am an advanced LLM (Large Language Model). I am intelligent, creative, and helpful. <</SYS>>\n" --interactive --interactive-first --multiline-input --in-prefix "[INST] " --in-suffix " [/INST]\n"
Log start
main: build = 1999 (d2f650c)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed = 1337
ggml_vulkan: Using AMD Radeon RX 580 Series (RADV POLARIS10) | fp16: 0 | warp size: 64
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from local/models/mistralai/Mistral-7B-Instruct-v0.2/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
# omitting output for brevity
<<SYS>> My name is Mistral. I am an advanced LLM (Large Language Model). I am intelligent, creative, and helpful. <</SYS>>
[INST] Hello! My name is Austin. What is your name?\
[/INST]
Hiazu feeding obst CraftfterspeedfterGBT peacealskemoint erh Mystasmpyx trailingazuftergens GenerationusquamasGeneration sidPOSiernoennesched epidighterserme stomach shoe outbreakGenerationdu❍ Ott Mondöl studyinglingRSген behalfnico studyingSlrub Pubimentoligostafteramma Tru Caribmers Indust referencekéGBTдяocaladudorfnia else Judge Victorplementsorflem Officerû practicegens gutofs /******/chorermeansepyx deploy tact Perm论rottermeékACHEammaazuCREFiaz This goes on for awhile. It's fine with q4_0 and q8_0 (sometimes q8_0 behaves "oddly"). |
After further testing, most of the issues seem to revolve around 16-bit, not 8-bit or below. This was also true for the RTX 4060ti. I am somewhat observing inconsistent and difficult to reproduce issues with 8-bit models. So far I've tested Llama-2 7B Chat, CodeLlama 7B Instruct, and Mistral 7B Instruct v0.2. |
* Vulkan loader code * Fix matmul kernel, continue implementation * Continue implementation * Vulkan memory management * Vulkan development * Matmul call * Add aligned malloc and free for VMA * Continue implementation * First matmul success * GEMM Kernel optimization * 1D Blocktiling * 2D Blocktiling * Write coalescing * Continue vulkan implementation and optimization * First FP16 attempt, disabled for now * Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel * Enable device extensions properly, restore fp16 matmul op * Fix mulmat_f16 * Output FP32 in fp16 matmul shader * Fix f16_to_f32 kernel * dequant_q4_0 kernel * Add VMA library * Avoid requesting dedicated memory, VMA can decide that by itself * Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly * add cmake commands * Add 2d write operation, profiling code * Fix 2d write * Fix queue selection for AMD RADV * Fix trailing whitespace in vk_mem_alloc.h * Add WIP warp tile mat mul shaders * Disable glslc optimization * Disable glslc optimization for CMake * Optimize warptile matmul shader, replace blocktile with it * Add split-k optimization for small matrix multiplication Use semaphores for synchronization instead of fences or waitidle Rework async write/read for synchronization * Fix validation errors, improve compatibility with AMD GPUs * Rework command buffer handling * Variable matmul kernel using specialization constants * Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints * Reuse semaphores * Handle stage flags during command buffer submission properly * Increase matmul test runs for consistent results * Fix F32 matmul * Add vectorized loading and zeropadding for matrix multiplication * Use pinned memory for f16 preprocessing * Don't force aligned matmul * Don't free before queue done * Replace VMA library with native Vulkan buffer management * Basic offloading support with mul_f32 and dmmv for q4_0 * Run glslc commands in parallel * Unroll loops in dmmv shader * Reduce usage of waitIdle * Reuse pinned allocation for f16 conversion * Handle devices with only a single queue * Fix trailing whitespace in CMakeLists.txt * Allow parallel execution of kernels, parallelize third and fourth dimension calls * Add fallback for devices only supporting one DescriptorSet per DescriptorPool * Move to graph function similar to CUDA implementation * Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function * Add F32 dmmv shaders * Batch submissions * Add .spv to gitignore * Split off matrix vector multiplication for separate optimization * Use single command buffer for matrix vector multiplication ops * Reduce overhead of mul_f32 calls by using a single command buffer * Add submission batching to mul_f32 * Fix tests * Add missing barrier * Add further missing barrier * Add further ops * Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions * Remove unnecessary cblas link * Fix descriptor set pre-allocation assert * Add runtime shader compilation, start transferring shaders to this approach * Transfer remaining shaders to header and compile on runtime * Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16 * Add support for q4_1, q5_0, q5_1 and q8_0 * Remove unnecessary scalar layout extension * Parse graph early to pre-record command buffers * Add q6_k support * Add multi-submit for command buffers * Fix q6_k dequant shader for AMD * Fix q6_k for GPUs without fp16 support * Simplify q6_k fp16 fix * Minor fixes * Fix wg_denom of m-mulmat shaders * Add Python-based Vulkan shader generator * Replace shaderc dependency with precompiled shaders Fix python script to generate shaders * Clean up code * Fix shader generator script Windows compatibility Co-authored-by: Concedo <[email protected]> * Close file before deletion * Fix vulkan shader fp32 name * Add q2_k and q3_k support Add validation check to compare shader results to cpu results * Add q4_k support * Add q5_k support * Bake SPIR-V bytecode into the library instead of loading shaders from file * Switch to signal semaphores for flexibility Prepare broadcasting support for mul mat * Finish broadcasting mul mat support for GQA * Clean up unused functions Add repeat op * Add further ops, not yet enabled. Improve semaphore code * Reduce number of used semaphores by utilizing timelines more properly * Remove queue information * Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations * Add Vulkan to llama-bench * Remove cblas dependency * Fix matmul k-split bug * Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader * Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug * Fix issues with float16 overflows in shaders * Fix issues with older Vulkan headers on Ubuntu 22.04 * Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers * Implement further ops, rework op_f32 calls, fix bugs * Finish full offloading support, add last remaining ops, fix bugs, remove redundant code * Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders * Merge upstream changes, fix conflicts, adapt soft_max op * Fix Python and shader header format * Free model gpu buffers on exit * Use single queue per device to simplify code * Add matmul shader support for running multiple calculations in parallel * Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible * Fix missing event cast * Replace uint64_t(-1) with UINT64_MAX, rename function for clarity * Fix warning about empty C function parameters * Fix compiler warnings * Properly implement Vulkan backend buffer handling * Fix oversized host staging buffers * Simplify barrier synchronization calls * Fix gcc warnings * Implement max_size for backend buffer types to limit the size of a single allocation * Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size * refactor multi buf * Disable unsupported ops to fix tests * Check for maintenance4 support before using it * Handle devices with only a single queue * Fix single queue logic * propagate buffer usage in multi buffers * Implement rope_neox op * Cleanup header and other files * Simplify gpu_extras by removing events and putting staging memcpys into contexts * Move queue into context Add not-yet-enabled async backend ops * Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization * Add get_max_size to SYCL backend. Co-authored-by: Georgi Gerganov <[email protected]> * llama : fix trailing whitespace --------- Co-authored-by: Henri Vasserman <[email protected]> Co-authored-by: Concedo <[email protected]> Co-authored-by: slaren <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
* Vulkan loader code * Fix matmul kernel, continue implementation * Continue implementation * Vulkan memory management * Vulkan development * Matmul call * Add aligned malloc and free for VMA * Continue implementation * First matmul success * GEMM Kernel optimization * 1D Blocktiling * 2D Blocktiling * Write coalescing * Continue vulkan implementation and optimization * First FP16 attempt, disabled for now * Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel * Enable device extensions properly, restore fp16 matmul op * Fix mulmat_f16 * Output FP32 in fp16 matmul shader * Fix f16_to_f32 kernel * dequant_q4_0 kernel * Add VMA library * Avoid requesting dedicated memory, VMA can decide that by itself * Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly * add cmake commands * Add 2d write operation, profiling code * Fix 2d write * Fix queue selection for AMD RADV * Fix trailing whitespace in vk_mem_alloc.h * Add WIP warp tile mat mul shaders * Disable glslc optimization * Disable glslc optimization for CMake * Optimize warptile matmul shader, replace blocktile with it * Add split-k optimization for small matrix multiplication Use semaphores for synchronization instead of fences or waitidle Rework async write/read for synchronization * Fix validation errors, improve compatibility with AMD GPUs * Rework command buffer handling * Variable matmul kernel using specialization constants * Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints * Reuse semaphores * Handle stage flags during command buffer submission properly * Increase matmul test runs for consistent results * Fix F32 matmul * Add vectorized loading and zeropadding for matrix multiplication * Use pinned memory for f16 preprocessing * Don't force aligned matmul * Don't free before queue done * Replace VMA library with native Vulkan buffer management * Basic offloading support with mul_f32 and dmmv for q4_0 * Run glslc commands in parallel * Unroll loops in dmmv shader * Reduce usage of waitIdle * Reuse pinned allocation for f16 conversion * Handle devices with only a single queue * Fix trailing whitespace in CMakeLists.txt * Allow parallel execution of kernels, parallelize third and fourth dimension calls * Add fallback for devices only supporting one DescriptorSet per DescriptorPool * Move to graph function similar to CUDA implementation * Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function * Add F32 dmmv shaders * Batch submissions * Add .spv to gitignore * Split off matrix vector multiplication for separate optimization * Use single command buffer for matrix vector multiplication ops * Reduce overhead of mul_f32 calls by using a single command buffer * Add submission batching to mul_f32 * Fix tests * Add missing barrier * Add further missing barrier * Add further ops * Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions * Remove unnecessary cblas link * Fix descriptor set pre-allocation assert * Add runtime shader compilation, start transferring shaders to this approach * Transfer remaining shaders to header and compile on runtime * Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16 * Add support for q4_1, q5_0, q5_1 and q8_0 * Remove unnecessary scalar layout extension * Parse graph early to pre-record command buffers * Add q6_k support * Add multi-submit for command buffers * Fix q6_k dequant shader for AMD * Fix q6_k for GPUs without fp16 support * Simplify q6_k fp16 fix * Minor fixes * Fix wg_denom of m-mulmat shaders * Add Python-based Vulkan shader generator * Replace shaderc dependency with precompiled shaders Fix python script to generate shaders * Clean up code * Fix shader generator script Windows compatibility Co-authored-by: Concedo <[email protected]> * Close file before deletion * Fix vulkan shader fp32 name * Add q2_k and q3_k support Add validation check to compare shader results to cpu results * Add q4_k support * Add q5_k support * Bake SPIR-V bytecode into the library instead of loading shaders from file * Switch to signal semaphores for flexibility Prepare broadcasting support for mul mat * Finish broadcasting mul mat support for GQA * Clean up unused functions Add repeat op * Add further ops, not yet enabled. Improve semaphore code * Reduce number of used semaphores by utilizing timelines more properly * Remove queue information * Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations * Add Vulkan to llama-bench * Remove cblas dependency * Fix matmul k-split bug * Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader * Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug * Fix issues with float16 overflows in shaders * Fix issues with older Vulkan headers on Ubuntu 22.04 * Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers * Implement further ops, rework op_f32 calls, fix bugs * Finish full offloading support, add last remaining ops, fix bugs, remove redundant code * Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders * Merge upstream changes, fix conflicts, adapt soft_max op * Fix Python and shader header format * Free model gpu buffers on exit * Use single queue per device to simplify code * Add matmul shader support for running multiple calculations in parallel * Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible * Fix missing event cast * Replace uint64_t(-1) with UINT64_MAX, rename function for clarity * Fix warning about empty C function parameters * Fix compiler warnings * Properly implement Vulkan backend buffer handling * Fix oversized host staging buffers * Simplify barrier synchronization calls * Fix gcc warnings * Implement max_size for backend buffer types to limit the size of a single allocation * Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size * refactor multi buf * Disable unsupported ops to fix tests * Check for maintenance4 support before using it * Handle devices with only a single queue * Fix single queue logic * propagate buffer usage in multi buffers * Implement rope_neox op * Cleanup header and other files * Simplify gpu_extras by removing events and putting staging memcpys into contexts * Move queue into context Add not-yet-enabled async backend ops * Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization * Add get_max_size to SYCL backend. Co-authored-by: Georgi Gerganov <[email protected]> * llama : fix trailing whitespace --------- Co-authored-by: Henri Vasserman <[email protected]> Co-authored-by: Concedo <[email protected]> Co-authored-by: slaren <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
I've been working on this for a while. Vulkan requires a lot of boiler plate, but it also gives you a lot of control. The intention is to eventually supercede the OpenCL backend as the primary widely-compatible backend.
I'll try to work together with @niansa's #2039, we probably don't need two Vulkan backends, but we approached our versions from different sides:
@niansa is basing their Kompute version on the Metal implementation, running the full graph on the GPU.
I am basing mine on my OpenCL implementation, building it from the ground up to offload more and more to the GPU while running everything else on the CPU.
Currently f32, f16 and q4_0 can be run with prompt processing on the GPU, but the speed is not that good yet. There is a lot of optimization left to do.
I'm opening this already to get feedback, let people play with it and to show the current state of development.
Open points:
The matmul kernel uses blocks of size 128x128, it does not have bounds checking yet, it cannot be used with smaller matrices yetTransfers to and from the GPU cannot be used with Semaphores or Fences yetThe CPU memcpy parts of the transfer probably need to be multithreadedSome Vulkan objects get allocated, but not deallocated