k-quants #1684

ikawrakow · 2023-06-03T15:24:31Z

What

This PR adds a series of 2-6 bit quantization methods, along with quantization mixes, as proposed in #1240 and #1256. Scalar, AVX2, ARM_NEON, and CUDA implementations are provided.

Why

This is best explained with the following graph, which shows perplexity on the wikitext dataset as a function of model size:

Note that the x-axis (model size in GiB) is logarithmic. The various circles on the graph show the perplexity of different quantization mixes added by this PR (see details below for explanation). The different colors indicate the LLaMA variant used (7B in black, 13B in red, 30B in blue, 65B in magenta). The solid squares in the corresponding color represent (model size, perplexity) for the original fp16 model. The dashed lines are added for convenience to allow for a better judgement of how closely the quantized models approach the fp16 perplexity. As we can see from this graph, generation performance as measured by perplexity is basically a fairly smooth function of quantized model size, and the quantization types added by the PR allow the user to pick the best performing quantized model, given the limits of their compute resources (in terms of being able to fully load the model into memory, but also in terms of inference speed, which tends to depend on the model size). As a specific example, the 2-bit quantization of the 30B model fits on the 16 GB RTX 4080 GPU that I have available, while the others do not, resulting in a large difference in inference performance.

Perhaps worth noting is that the 6-bit quantized perplexity is within 0.1% or better from the original fp16 model.

Another interesting observation is that the relative quantization error (as measured by perplexity) does not decrease with increasing number of weights in the base model, as one might hypothesize based on the lower quantization error observed at 13B compared to 7B (see, e.g., this table on the main page). The 13B model is indeed somehow better amenable to quantization, but relative quantization error goes back to the 7B level for the 30B and 65B models. This is illustrated with the following graph, which represents an alternative view of the data in the above graph, by showing the relative difference to the fp16 model in percent. Note that now the x-axis, being the ratio of the quantized size to the fp16 model size, is linear, while the y-axis (percent error) is logarithmic.

How (Details)

In the existing ggml quantization types we have "type-0" (Q4_0, Q5_0) and "type-1" (Q4_1, Q5_1). In "type-0", weights w are obtained from quants q using w = d * q, where d is the block scale. In "type-1", weights are given by w = d * q + m, where m is the block minimum. I use this to describe the quantizations being added by this PR.

The following new quantization types are added to ggml:

GGML_TYPE_Q2_K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
GGML_TYPE_Q8_K - "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.

This is exposed via llama.cpp quantization types that define various "quantization mixes" as follows:

LLAMA_FTYPE_MOSTLY_Q2_K - uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors.
LLAMA_FTYPE_MOSTLY_Q3_K_S - uses GGML_TYPE_Q3_K for all tensors
LLAMA_FTYPE_MOSTLY_Q3_K_M - uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
LLAMA_FTYPE_MOSTLY_Q3_K_L - uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
LLAMA_FTYPE_MOSTLY_Q4_K_S - uses GGML_TYPE_Q4_K for all tensors
LLAMA_FTYPE_MOSTLY_Q4_K_M - uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K
LLAMA_FTYPE_MOSTLY_Q5_K_S - uses GGML_TYPE_Q5_K for all tensors
LLAMA_FTYPE_MOSTLY_Q5_K_M - uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K
LLAMA_FTYPE_MOSTLY_Q6_K- uses 6-bit quantization (GGML_TYPE_Q8_K) for all tensors

Not mentioned explicitly above is the fact that with this PR, all quantization variants use 6-bit quantization for the output.weight tensor. This lowers the perplexity of, e.g., Q4_0 by about 0.03 at 7B.

The code is quite lengthy, so it is added via separate files k_quants.h, k_qunats.c instead of being added to ggml.c. I think that it would be better to also factor out all other quantization types from ggml.c, but that is up to @ggerganov to decide.

Performance

The following table summarizes the performance results (perplexity, model size, run time for single token prediction). It is basically designed after the corresponding table on the main page).

Model	Measure	F16	Q2_K	Q3_K_S	Q3_K_M	Q3_K_L	Q4_K_S	Q4_K_M	Q5_K_S	Q5_K_M	Q6_K
7B	perplexity	5.9066	6.7764	6.4571	6.1503	6.0869	6.0215	5.9601	5.9419	5.9208	5.9110
7B	file size	13.0G	2.67G	2.75G	3.06G	3.35G	3.56G	3.80G	4.33G	4.45G	5.15G
7B	ms/tok@4th, M2 Max	116	56	81	69	76	50	55	70	71	75
7B	ms/tok@8th, M2 Max	111	36	46	36	46	36	40	44	46	51
7B	ms/tok@4th, RTX-4080	60	15.5	18.6	17.0	17.7	15.5	16.0	16.7	16.9	18.3
7B	ms/tok@4th, Ryzen7950X	214	57	58	61	67	68	71	81	82	93
13B	perplexity	5.2543	5.8545	5.6033	5.4498	5.4063	5.3404	5.3002	5.2785	5.2638	5.2568
13B	file size	25.0G	5.13G	5.27G	5.88G	6.45G	6.80G	7.32G	8.36G	8.60G	9.95G
13B	ms/tok@4th, M2 Max	216	103	156	148	144	95	102	132	134	142
13B	ms/tok@8th, M2 Max	213	67	83	77	83	68	73	81	84	95
13B	ms/tok@4th, RTX-4080	-	25.3	29.2	29.3	25.5	26.2	26.2	28.6	28.9	30.0
13B	ms/tok@4th, Ryzen7950X	414	109	113	118	129	130	137	156	161	180

I realize the above table is not easy to read, so adding a shortened version that shows a subset of the above data:

Model	Measure	F16	Q2_K	Q3_K_M	Q4_K_S	Q5_K_S	Q6_K
7B	perplexity	5.9066	6.7764	6.1503	6.0215	5.9419	5.9110
7B	file size	13.0G	2.67G	3.06G	3.56G	4.33G	5.15G
7B	ms/tok @ 4th, M2 Max	116	56	69	50	70	75
7B	ms/tok @ 8th, M2 Max	111	36	36	36	44	51
7B	ms/tok @ 4th, RTX-4080	60	15.5	17.0	15.5	16.7	18.3
7B	ms/tok @ 4th, Ryzen	214	57	61	68	81	93
13B	perplexity	5.2543	5.8545	5.4498	5.3404	5.2785	5.2568
13B	file size	25.0G	5.13G	5.88G	6.80G	8.36G	9.95G
13B	ms/tok @ 4th, M2 Max	216	103	148	95	132	142
13B	ms/tok @ 8th, M2 Max	213	67	77	68	81	95
13B	ms/tok @ 4th, RTX-4080	-	25.3	29.3	26.2	28.6	30.0
13B	ms/tok @ 4th, Ryzen	414	109	118	130	156	180

I think it is better to have quantization separate from ggml. For now just adding the k-quants there, but it would be better to also factor out the existing ggml quantizations.

CUDA is not ideal - ~50% slower than Q4_0 for single token prediction, about the same in batch mode (perplexity). CPU single token is ~55 ms (on Ryzen 7950X).

It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0.

Single token is now 20.5 ms/token (~20% slower than Q4_0). Perplexity is on par with Q4_0.

Performance is the same or perhaps very slightly better than Q4_0 on the CPU. On the GPU, single token prediction is ~10% better than Q4_0, batch mode (perplexity is about the same).

Performance is ~40% lower compared to Q4_K on the CPU. This is to be expected, considering that we are memory bound on the CPU and the 6-bit model is ~44% larger than the 4-bit. On the GPU, single token prediction is ~6% lower than Q4_0, batch mode (perplexity) is even closer (but still slower).

Performance is ~20% lower compared to Q4_K on the CPU. This is to be expected, considering that we are memory bound on the CPU and the 5-bit model is ~22% larger than the 4-bit. On the GPU, single token prediction is about the same as Q4_0 for both, single token and batch prediction.

It is 22% slower than Q4_K, despite the smaller model size. On x86_64, where we are memory bound, the Q3_K model is quite a bit faster than Q4_K.

Token prediction is pretty good - about 15.5 ms on a RTX 4080. Perplexity is about the same as Q4_K.

About the same performance as Q4_K.

Single token prediction is now ~36 ms on M2 Max. The code is much simpler too.

Stranegly enough, for the few prompts I tried with the 7B model the responses looked perfectly reasonable. Only realized something is not quite right when I tried the larger models and started getting nonse back. In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X box iusing CUDA and model fully loaded on the GPU are ~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B. The max number of layers that fit in VRAM for The 65B is 32. With that, we get ~330 ms per token, which is not that much faster than just running on the CPU (~470 ms per token).

Q3_K is now running at ~18.5 ms / token on CUDA, so the gap to Q4_0 is only 10%. It seems memory acccess pattern is more important for performance than the amount of computation the kernel does.

For perplexity, where we are less memory bound, time per pass drops by ~5%. Barely measurable difference for single token prediction.

We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit quantization variants.

RonanKMcGovern · 2023-08-06T22:15:54Z

Separate Q - comparing ggml k_m to GPTQ:

I suppose GPTQ with large 128 groups is simpler than ggml and that’s why it runs faster?
GPTQ also actively corrects (using training data) to keep the Hessian of the loss function similar. So perplexity is maybe better - although hard to tell how well that generalizes outside the quantisation dataset?

JohannesGaessler · 2023-08-07T07:08:18Z

I suppose GPTQ with large 128 groups is simpler than ggml and that’s why it runs faster?

The last time I checked (on Linux) llama.cpp is faster than AutoGPTQ and about the same speed as ExLlama for token generation. On Windows llama.cpp performance is currently worse by a factor of ~2 but that is caused by the much higher kernel launch overhead on Windows. Prompt processing using llama.cpp is currently noticeably slower than ExLlama (on all OSs) but this is not due to the quantization format because currently by default both llama.cpp and ExLlama dequantize the entire weight matrix only once per matrix matrix multiplication.

More generally, I've tried rearranging the quantized data to spatially separate the quantized values and the scales (which I think is how the data is stored in GPTQ) but this did not improve performance. I interpreted this as cache locality being more important than memory alignment (or I just did it wrong).

RonanKMcGovern · 2023-08-07T08:48:57Z

The last time I checked (on Linux) llama.cpp is faster than AutoGPTQ and about the same speed as ExLlama for token generation.

Yeah I was inferencing using transformers in Colab with T4. That may have caused specific issues as doesn't leverage llamaccp

Dampfinchen · 2023-08-11T18:25:39Z

On Windows llama.cpp performance is currently worse by a factor of ~2 but that is caused by the much higher kernel launch overhead on Windows.

That's a bit hard to believe. Do you have data to back this up? If so, is there any way to circumvent these slowdowns? Most people are running Windows after all.

JohannesGaessler · 2023-08-11T18:48:13Z

Do you have data to back this up?

I'll get to expanding the GPU section of my blog soon and I'll put numbers there.

If so, is there any way to circumvent these slowdowns? Most people are running Windows after all.

You can go to settings -> System -> Graphics Settings and enable hardware-accelerated GPU scheduling which somewhat helps but honestly the best solution is to just use Linux.

JohannesGaessler · 2023-08-11T18:50:53Z

Also the CUDA rework by slaren will apparently reduce kernel launch overhead: #2230 (comment)

RonanKMcGovern · 2023-08-12T00:02:36Z

I suppose GPTQ with large 128 groups is simpler than ggml and that’s why it runs faster?

Just to address this:

The GPTQ speed-up has to do with the use of special quantized-matrix - full precision-vector kernels that reduce the bandwidth/fetching required compared to having the matrices in full precision. Where memory bandwidth is rate limiting, this allows for for significant speed ups. This is my understanding from the paper.

I don't see why ggml couldn't take this approach too: #2585 (comment)

jrudolph · 2023-08-14T10:23:52Z

More generally, I've tried rearranging the quantized data to spatially separate the quantized values and the scales (which I think is how the data is stored in GPTQ) but this did not improve performance. I interpreted this as cache locality being more important than memory alignment (or I just did it wrong).

I also did some testing of a similar setup in llama2.scala, to figure out whether being able to use _mm256_load_si256 over _mm256_loadu_si256 could have any impact but I didn't find much (on Ryzen 4800H which should be quite similar to your 3700X). The problem is that it is easy to get things wrong slightly and also hard to debug. My thinking was that it surely cannot be beneficial to load data into SIMD vectors splitting cache lines but maybe the CPU backend has enough resources these days for it not to matter (or the load latency is already masked so that a small latency hit does not add to the critical path)? Also, even if the cache lines are split, we will still be able to use the data in the next iteration directly from cache?

yiliu30 · 2023-10-09T06:15:20Z

@ikawrakow Hi, I have a question about the naming convention. Do _S, _M, and _L stand for Small, Medium, and Large (refer to the model size)?

mirek190 · 2023-10-09T08:43:11Z

Yes

PoignardAzur · 2024-01-02T09:25:36Z

I would like to produce a k-quantized version of some files (the distil-whisper models, but the question applies to any LLM). Is there a reference implementation of the quantizer somewhere?

I could look at the ggml-quants.c contents and work it out from there, but even then I'd like a reference implementation to compare against.

qwerr0 · 2024-01-16T12:52:02Z

amzing work.

francoisfleuret · 2024-05-12T20:19:37Z

This is best explained with the following graph, which shows perplexity on the wikitext dataset as a function of model size:

This graph is gorgeous. Any hope to have the raw numbers to replot it?

Seedmanc · 2024-05-29T09:59:52Z

So, as a bottom line, is an L quant of a lower bit depth better or worse than an S quant of a higher one? Like Q4 L vs Q5 S.
Can the file size be always used as a measure of quant's quality when given a list of various bit depths and quant letters? Some simple guidelines for choosing which model version to download would be nice, this diversity of versions is too much to handle.

Kawrakow added 26 commits June 3, 2023 14:43

Starting to add k-quantization to ggml

8673a41

I think it is better to have quantization separate from ggml. For now just adding the k-quants there, but it would be better to also factor out the existing ggml quantizations.

Adding Q3_K and Q8_K (de)-quantization

b4f7134

Q3_K now working on CUDA and AVX2/scalar

c93cce3

CUDA is not ideal - ~50% slower than Q4_0 for single token prediction, about the same in batch mode (perplexity). CPU single token is ~55 ms (on Ryzen 7950X).

Some improvement for Q3_K on CUDA

a3c0673

It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0.

Some more CUDA optimizations for Q3_K

3d8b1de

Single token is now 20.5 ms/token (~20% slower than Q4_0). Perplexity is on par with Q4_0.

Adding Q4_K - scalar, AVX2, CUDA

a0b8e9f

Performance is the same or perhaps very slightly better than Q4_0 on the CPU. On the GPU, single token prediction is ~10% better than Q4_0, batch mode (perplexity is about the same).

Per convention, all QX_K quantizations use Q5_K for output.weight

5c5191a

Adding quantization mixes

d537b97

Quantization mixes: didn't quite get what I wanted in the last commit

54f808d

Q4_K dot product for ARM_NEON

a2533a7

Q6_K dot product for ARM_NEON

5ca15ce

Q5_K dot product for ARM_NEON

a197eb5

Adding Q3_K dot for ARM_NEON

13264fa

It is 22% slower than Q4_K, despite the smaller model size. On x86_64, where we are memory bound, the Q3_K model is quite a bit faster than Q4_K.

A very slightly faster ARM_NEON Q3_K dot

4faa040

Adding Q2_K - just CUDA for now

b439efb

Token prediction is pretty good - about 15.5 ms on a RTX 4080. Perplexity is about the same as Q4_K.

Adding scalar and AVX2 Q2_K dot

8516fdf

Adding ARM_NEON Q2_K dot

6ec7057

About the same performance as Q4_K.

A slightly faster ARM_NEON Q2_K dot

7bcc376

Single token prediction is now ~36 ms on M2 Max. The code is much simpler too.

Don't print zeros/NaNs when no count histogram has been collected

c5959d5

A 10% faster CUDA vector dot kernel for Q3_K

9a9c5a0

Q3_K is now running at ~18.5 ms / token on CUDA, so the gap to Q4_0 is only 10%. It seems memory acccess pattern is more important for performance than the amount of computation the kernel does.

A slightly daster Q4_K AVX2 dot product

894210a

For perplexity, where we are less memory bound, time per pass drops by ~5%. Barely measurable difference for single token prediction.

A slightly faster ARM_NEON A4_K dot product

abd99a8

Minor

8f5d42d

ikawrakow requested a review from ggerganov June 3, 2023 15:24

This comment was marked as off-topic.

Sign in to view

Fix quantization error test

6ef1382

We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit quantization variants.

This comment was marked as off-topic.

Sign in to view

jrudolph mentioned this pull request Aug 14, 2023

Quantization Brainstorming karpathy/llama2.c#277

Open

ikawrakow mentioned this pull request Sep 10, 2023

Adding SqueezeLLM Support #3093

Open

saharNooby mentioned this pull request Sep 21, 2023

Port k-quants support from ggerganov/llama.cpp to ggerganov/ggml ggerganov/ggml#532

Open

yansh97 mentioned this pull request Oct 17, 2023

是否考虑支持llama.cpp中新增加的k_quants方法 li-plus/chatglm.cpp#147

Closed

ElliottDyson mentioned this pull request Nov 6, 2023

Support for k-means clustered quantisations? intel/intel-extension-for-transformers#480

Closed

PoignardAzur mentioned this pull request Dec 31, 2023

SpQR compression method ggerganov/ggml#240

Open

ikawrakow mentioned this pull request Jan 29, 2024

SOTA 3-bit quants #5196

Merged

matthiasgeihs mentioned this pull request Feb 14, 2024

Llama2: q4_km as default? ollama/ollama#2495

Closed

namtranase mentioned this pull request Feb 23, 2024

[Feature request] Add quantization methods google/gemma.cpp#17

Open

flaneur2020 mentioned this pull request Mar 6, 2024

Add K quants support to CPU tensor buf crabml/crabml#75

Closed

6 tasks

sigma-andex mentioned this pull request May 1, 2024

Q4 deep dive huggingface/ratchet#38

Closed

mofosyne mentioned this pull request May 15, 2024

gguf.md: Add GGUF Naming Convention Section ggerganov/ggml#822

Merged

mofosyne added Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 25, 2024

mofosyne mentioned this pull request May 29, 2024

Performance improvements on Arm for legacy and k-quants Mozilla-Ocho/llamafile#453

Merged

dukebw mentioned this pull request Jun 9, 2024

[Feature Request] Why we can't use Q8 quants? modularml/max#173

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k-quants #1684

k-quants #1684

ikawrakow commented Jun 3, 2023 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

RonanKMcGovern commented Aug 6, 2023

JohannesGaessler commented Aug 7, 2023 •

edited

Loading

RonanKMcGovern commented Aug 7, 2023

Dampfinchen commented Aug 11, 2023

JohannesGaessler commented Aug 11, 2023 •

edited

Loading

JohannesGaessler commented Aug 11, 2023

RonanKMcGovern commented Aug 12, 2023 •

edited

Loading

jrudolph commented Aug 14, 2023

yiliu30 commented Oct 9, 2023 •

edited

Loading

mirek190 commented Oct 9, 2023

PoignardAzur commented Jan 2, 2024

qwerr0 commented Jan 16, 2024

francoisfleuret commented May 12, 2024

Seedmanc commented May 29, 2024

k-quants #1684

k-quants #1684

Conversation

ikawrakow commented Jun 3, 2023 • edited Loading

What

Why

How (Details)

Performance

This comment was marked as off-topic.

This comment was marked as off-topic.

RonanKMcGovern commented Aug 6, 2023

JohannesGaessler commented Aug 7, 2023 • edited Loading

RonanKMcGovern commented Aug 7, 2023

Dampfinchen commented Aug 11, 2023

JohannesGaessler commented Aug 11, 2023 • edited Loading

JohannesGaessler commented Aug 11, 2023

RonanKMcGovern commented Aug 12, 2023 • edited Loading

jrudolph commented Aug 14, 2023

yiliu30 commented Oct 9, 2023 • edited Loading

mirek190 commented Oct 9, 2023

PoignardAzur commented Jan 2, 2024

qwerr0 commented Jan 16, 2024

francoisfleuret commented May 12, 2024

Seedmanc commented May 29, 2024

ikawrakow commented Jun 3, 2023 •

edited

Loading

JohannesGaessler commented Aug 7, 2023 •

edited

Loading

JohannesGaessler commented Aug 11, 2023 •

edited

Loading

RonanKMcGovern commented Aug 12, 2023 •

edited

Loading

yiliu30 commented Oct 9, 2023 •

edited

Loading