-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: refactor and optimize IQ MMVQ #8215
CUDA: refactor and optimize IQ MMVQ #8215
Conversation
Thank you for tagging me. I'm not quite familiar with IQ_XXX so I tried @OuadiElfarouki @luoyu-intel for awareness. |
iq4_xs as well as all iq2 and iq3 models should be affected. Overall the changes I made to SYCL should be very simple, I just can't test whether they actually work. |
Shouldn't the removed code allow the IQ quants to work with GPUs without dp4a? I understand that due to the CC check |
If |
Right, but the problem is that dmmv does not support IQ quants. Alternatively it can be reported correctly in |
How about this: replace |
Performance with a |
Meta-Llama-3-8B-Instruct-Q6_K.gguf on mainggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: 0ddeff1 (3273) Meta-Llama-3-8B-Instruct-Q6_K.gguf on branchggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: 30f85eb (3271) 🚀 Meta-Llama-3-8B-Instruct-IQ4_NL.gguf main
Meta-Llama-3-8B-Instruct-IQ4_NL.gguf branchDevice 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
Meta-Llama-3-8B-Instruct-Q8_0.gguf mainDevice 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
Meta-Llama-3-8B-Instruct-Q8_0.gguf branchDevice 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
🚀 🐎 |
This PR improved batch performance by almost an order of magnitude at both 2- and 3- streams on my P100 as well: Meta-Llama-3-8B-Instruct-Q8_0.gguf batch mainmain: n_kv_max = 4096, n_batch = 512, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 99, n_threads = 14, n_threads_batch = 14
(not sure why b=2 got WORSE here but I re-ran several times) Meta-Llama-3-8B-Instruct-Q8_0.gguf batch branchmain: n_kv_max = 4096, n_batch = 512, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 99, n_threads = 14, n_threads_batch = 14
b=2 is now double and b=3 is the sweet spot for overall throughput Meta-Llama-3-8B-Instruct-IQ4_NL.gguf batch branchmain: n_kv_max = 4096, n_batch = 512, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 99, n_threads = 14, n_threads_batch = 14
I am unable to compare with main, same On branch IQ4 shows same trend as Q8. |
I have a device that CUDA lists as |
No. |
* CUDA: refactor and optimize IQ MMVQ * uint -> uint32_t * __dp4a -> ggml_cuda_dp4a * remove MIN_CC_DP4A checks * change default * try CI fix
Understood cuda-iq-opt-3 is merged. Is test data from a "Maxwell 2.0" / Compute 5.2 GPU still helpful? llama-simple and llama-benchmark-matmult worked with a Tesla M40. I ran through these before the M40 reached 90oC:
llama-bench, -batched-bench, -server, and -cli all core dumped. I have detailed files. |
Bench and server in ggerganov:master are working great with mixed 5.2 and 6.1
These are all working in master |
More data is always helpful. If it turns out that some changes in this PR were bad they can potentially be reverted.
Just to be clear, do you mean that they work on master prior to or after this PR? |
these work in master after this PR was merged. ./llama-batched-bench --version I'm running more tests against the M40 on its own now. Will gather and share more ./llama-bench -m llava-v1.6-vicuna-13b.Q4_K_M.gguf -o md -fa 1 -ngl 41 -b 32 -ub 32 -p 32 -t 20
build: f619024 (3291) |
I've run llama-bench and llama-batched-bench several times against several models and only the Tesla M40 available. Verbosity is available. ./llama-batched-bench --version Is running these same against earlier pre-merge builds helpful?
main: n_kv_max = 4096, n_batch = 32, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 99, n_threads = 12, n_threads_batch = 12
|
I don't need the numbers in isolation. I only would have wanted to know whether there is a performance regression since I changed one of the defaults. But if it didn't work prior to the PR anyways there is no point. |
These are llama-bench runs built against ggerganov:master tags/b3266 (pre-merge cuda-iq-opt-3 build: 1c5eba6 (3266)) and post-merge build: f619024 (3291)
The -bench run times are much better in the new builds. I don't see huge t/s deltas. build: 1c5eba6 (3266) - Hathor-L3-8B-v.01-Q5_K_M-imat.gguftime ./llama-bench -m /mnt/models/gguf/Hathor-L3-8B-v.01-Q5_K_M-imat.gguf -t 20 -fa 1 -ngl 99 -b 512 -ub 512
build: 1c5eba6 (3266)
build: 213701b (3324) - Hathor-L3-8B-v.01-Q5_K_M-imat.gguftime ./llama-bench -m /mnt/models/gguf/Hathor-L3-8B-v.01-Q5_K_M-imat.gguf -t 20 -fa 1 -ngl 99 -b 512 -ub 512
build: 213701b (3324)
build: 1c5eba6 (3266) - replete-coder-llama3-8b-iq4_nl-imat.gguf
build: 213701b (3324) - replete-coder-llama3-8b-iq4_nl-imat.gguftime ./llama-bench -m /mnt/models/gguf/replete-coder-llama3-8b-iq4_nl-imat.gguf -t 20 -fa 1 -ngl 99 -b 512 -ub 512
build: 213701b (3324)
build: 1c5eba6 (3266) - llava-v1.6-vicuna-13b.Q4_K_M.gguftime ./llama-bench -m /mnt/models/gguf/llava-v1.6-vicuna-13b.Q4_K_M.gguf -t 20 -fa 1 -ngl 99 -b 512 -ub 512
build: 1c5eba6 (3266)
build: 213701b (3324) - llava-v1.6-vicuna-13b.Q4_K_M.gguftime ./llama-bench -m /mnt/models/gguf/llava-v1.6-vicuna-13b.Q4_K_M.gguf -t 20 -fa 1 -ngl 99 -b 512 -ub 512
build: 213701b (3324)
|
Thanks, those numbers look good. |
This PR refactors and optimizes the IQ MMVQ CUDA code. Notably as part of these changes I'm changing some values in
ggml-common.h
. The "qr" values are meant to represent how many low bit data values are contained in a single 8 bit integer. This value is used to derive "qi" which represents how many 32 bit integers are needed to represent the low bit data values of a quantized block. These values are intended to be properties of the data type independent of any kernels.In MMVQ qr and qi are used to determine how one load of 32 integers for the quantized weights needs to be aligned with the loads of the q8 activations. It is oftentimes beneficial to load more values at once which is intended to be done via the "vdr" value which is a factor that increases the number of simultaneous loads so that the total stride per invocation of
vec_dot_q_cuda
is qr*vdr. However, for the IQ quants this was instead done by increasing QR. This does not matter for MMVQ but it's a problem for MMQ where the values of qr and qi matter for determining how much shared memory needs to be allocated and how the activations need to be loaded. So for this reason I'm changing the qr and qi values of the IQ quants to the originally intended values. Notably this affects the SYCL backend but I am not able to test the corresponding changes myself due to a lack of Intel hardware. @arthw @airMeng I don't know who to tag in terms of llama.cpp SYCL developers; please either test my changes or tell me who I should contact.Performance changes