Pulling new quantization format Q4_1_O into upstream ggml #89

saharNooby · 2023-04-17T06:14:29Z

When developing rwkv.cpp, I've discovered that existing quantization formats Q4_0 and Q4_1 break RWKV (that is, perplexity becomes 10x higher and the output is garbage). I've documented my observations in this issue. Looks like this is caused both by outliers in weights, and outliers in activations.

To solve this, I've created a new format Q4_1_O. Commit in rwkv.cpp. Comparisons.

Most important things about the format:

it is based on Q4_1
it stores min & delta values in FP16, not FP32
per 32-element block, it losslessly stores a single absmax FP16 value (called "outlier") and its index in the block; all other values are quantized as if there was no outlier
matmul is done in FP32, that is, I dequantize the matrix and multuply it by activations already in FP32
per-token latency is ~~the same as FP32~~ 40% slower than FP16 (on my machine)
perplexity is, as expected with any quantization, slightly higher than FP16, but principle "it's better to use quantized X+1 model than FP16 X model" holds

TL;DR: store single outlier value per block unquantized; dot in FP32.

Recently, it became clear that my ggml fork and upstream ggml (in llama.cpp/here) began to greatly diverge: Code difference is getting more between ggml and rwkv.cpp.

I would like to keep interventions in my copy of ggml as small as possible, so I can pull latest optimizations/fixes without the need to apply all my changes again.

Specifically, I ask: does it sound like Q4_1_O format belongs to upstream ggml? If so, I can create a PR here.

The text was updated successfully, but these errors were encountered:

ggerganov · 2023-04-17T08:34:13Z

matmul is done in FP32, that is, I dequantize the matrix and multuply it by activations already in FP32

Do you have an estimate of how much that change affected the perplexity in isolation from the Q4_1_0 format?

We are doing some active work in llama.cpp on improving the quantization accuracy of the existing formats and one of the recent conclusions was that doing the mat mul in 8-bit format is almost equivalent to doing it in 32-bit format in terms of quality and as fast as doing it in 4-bit format (the one used when you made the fork). Additionally, doing the mat mul in 32-bit format significantly improves the perplexity compared to the original 4-bit mat mul.

All this is upstream in ggml - not sure how hard would it be for you to give it a try.

There are additional improvements pending to the Q4_0 and potentially Q4_1 format that I am hoping will make the accuracy between quantized and non-quantized models even more similar.

And an additional idea for keeping one of the tensors in full-precision is also likely to be added. But this is not related to ggml directly - rather to the transformer. Not sure if it is applicable to RWKV. My initial experiments show that doing this is currently enough to make the smallest GPT-2 117M work with quantization, instead of breaking down (haven't measured perplexity though).

With that said, I am hoping after all this work is done to try and see if RWKV still breaks down using the existing formats.
I understand the weight distribution in RWKV has outliers and this will likely still not work, but think it is worth it to give it a try before adding Q4_1_0 to the code base. Adding it would increase development and maintenance efforts.

There also might be an option of replacing Q4_1 with Q4_1_0, but cannot say for sure yet.

saharNooby · 2023-04-17T12:27:12Z

Do you have an estimate of how much that change affected the perplexity in isolation from the Q4_1_0 format?

As a rough estimate, here is RWKV 169M ppl on a small, private dataset:

Vanilla Q4_1:                           perplexity  23.642
FP32 dot:                               perplexity  18.402
With outliers preservation, FP32 dot:   perplexity  16.231
FP16:                                   perplexity  14.861

FP32 dot greatly reduces perplexity, but preserving outlier weights is important too.

And an additional idea for keeping one of the tensors in full-precision is also likely to be added

This is relevant for RWKV and has been already applied -- we decided not to quantize both emb and head (the unembedding matrix), because they are not too big in comparison to the rest of the model, and quantizing them is not worth the reduction in quality.

but think it is worth it to give it a try before adding Q4_1_0 to the code base

I understand, sounds reasonable.

There's work to do even before attempting to pull Q4_1_O: I need to depend on a submodule instead of copied files; exp, sigmoid and max (maybe also 1_minus_x) operators needs to be pulled into upstream. After this is done, I can do comparisons of Q4_1_O with all other formats with latest improvements.

(BTW, it's Q4_1_O ("oh", for "outliers"), not Q4_1_0 (zero))

ggerganov · 2023-04-17T12:44:52Z

We now support 1D and 2D custom mapping operators in ggml:

ggml/include/ggml/ggml.h

Lines 660 to 675 in cc91df0

 
 // Mapping operations 

 typedef void (*ggml_unary_op_f32_t)(const int, float *, const float *); 

 typedef void (*ggml_binary_op_f32_t)(const int, float *, const float *, const float *); 

 struct ggml_tensor * ggml_map_unary_f32( 

 struct ggml_context * ctx, 

 struct ggml_tensor * a, 

 const ggml_unary_op_f32_t fun); 

 struct ggml_tensor * ggml_map_binary_f32( 

 struct ggml_context * ctx, 

 struct ggml_tensor * a, 

 struct ggml_tensor * b, 

 const ggml_binary_op_f32_t fun);

One option is to implement exp, sigmoid, etc with those mapping and this way keep their implementation inside RWKV. Of course, we can also just add them as operators in ggml. It's a matter of some balance - in early development stage it's better to have a smaller codebase to deal with, so we can iterate more quickly. After we know the common use cases, we can make appropriate refactorings and expand further the functionality.

These results are very useful.
My expectation is that after ggerganov/llama.cpp#995 , the new Q4_0 with 2x F16 factors will give you ~18.4 ppl on that test. I think we should try this out when it is ready, to confirm that all these analysis are reasonable and decide how to proceed.

saharNooby · 2023-04-17T13:44:38Z

One option is to implement exp, sigmoid, etc with those mapping

Already did it, works flawlessly! Thanks @KerfuffleV2. It really saved my time.

I will now focus on porting Q4_1_O to my up-to-date fork of ggml.

to confirm that all these analysis are reasonable

I'll try to document exactly what setup I use for perplexity measure, so it's reproducible. Unfortunately, I don't want to run wikitext perplexity tests because they take days, so I do much smaller tests on a single file of ~4K tokens. Not ideal, but I beleive it is still representative and good enough.

saharNooby · 2023-04-17T16:20:11Z

I've updated ggml in rwkv.cpp to the latest version and did more rigorous measurements.

Measuring set-up.

Pexplexity for RWKV 169M:

Data type	Perplexity
Q4_0	18.599
Q4_1	19.389
Q4_1_O	16.700
FP16	15.623
FP32	15.623

Per-token latency for RWKV 1B5 on Windows with AVX2:

Data type	Latency, ms
Q4_0	59
Q4_1	102
Q4_1_O	144
FP16	116
FP32	207

Interestingly, Q4_0 is now better for perplexity than Q4_1 -- this was not the case before. And it's very fast.

Q4_1_O, for some reason, became faster after the update, and is now somewhat slower than FP16. And it's still the best by perplexity among quantized formats.

I'll wait until Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors is done, and will pull the changes and redo the measurements.

ggerganov · 2023-04-17T16:24:49Z

The Q4_1 is slower because it is not yet using the intermediate 8-bit quantization as Q4_0 (ggerganov/llama.cpp#951). It is still quantizing down to 4-bits for the mat mul, which we know is not accurate. The implementation is postponed until we figure out if Q4_0 can become even better

saharNooby · 2023-04-18T05:53:39Z

@ggerganov Hi! I have a small question about ggml. When calculating work_size for matmul node, do I need to multiply the size by thread count? I need a per-thread temporary buffer of size elements_in_vector * sizeof(float), buffers should not overlap between threads. Commit with context

BTW, Q4_1_O was just optimized again and ~~is now as fast as FP16, yay!~~ Edit: actually 20% slower than FP16, but is still 10% faster for 7B model than previous implementation

ggerganov · 2023-04-22T10:33:13Z

@saharNooby

If it is not too much work for you, could you perform the perplexity / speed tests on the new Q4_2 and Q4_3 quantization formats - would be interesting to see how things improve for RWKV. Also Q4_1 should be faster now on latest master

saharNooby · 2023-04-22T13:45:12Z

ggml version. Measuring set-up.

Perplexity for RWKV 169M:

Data type	Before	After
Q4_0	18.599	18.599
Q4_1	19.389	17.187
Q4_1_O	16.700	16.700
Q4_2	N/A	17.060
Q4_3	N/A	16.850
FP16	15.623	15.623
FP32	15.623	15.623

Per-token latency for RWKV 1B5 on Windows with AVX2:

Data type	Before, ms	After, ms
Q4_0	59	64
Q4_1	102	71
Q4_1_O	144	141
Q4_2	N/A	85
Q4_3	N/A	95
FP16	116	117
FP32	207	198

Q4_1 became better and faster.

Q4_1_O still has the best perplexity among quantized formats, but considering its speed, it's not clear whether it would be more useful than Q4_3. Q4_3 is very close in perplexity.

I need to do more testing to decide whether Q4_1_O is not needed anymore, and Q4_3 should be used instead.

ggerganov · 2023-04-27T16:16:29Z

@saharNooby

There are now new Q5_0 and Q5_1 quantization methods available.
See our evaluation for the LLaMA model:

https://github.com/ggerganov/llama.cpp#quantization

Would be interesting to see how they perform with RWKV

saharNooby · 2023-04-27T16:28:17Z

@ggerganov Great! I guess I'll need to test Q8_0 too for the full picture.

Do you plan to add more quantization formats in the near future?

KerfuffleV2 · 2023-04-27T16:36:20Z

A day without finding a new quantization format just means you forgot to pull the repo.

(I actually love the rapid progress and iteration, so don't take that as any kind of complaint.)

ggerganov · 2023-04-27T16:36:42Z

I think for the near future we will support these formats.
Still need to evaluate how well they perform and if we really need all of them.

The Q8_0 format is currently optimized only for ARM NEON, so if you are on x86 it will be very slow.
Will soon add AVX optimization.

saharNooby · 2023-04-29T12:56:25Z

Tested Q5 and Q8 formats with the same settings:

Format	Perplexity (169M)	Latency, ms (1.5B)	File size, GB (1.5B)
`Q4_0`	17.507	76	1.53
`Q4_1`	17.187	72	1.68
`Q4_1_O`	16.700	141	1.68
`Q4_2`	17.060	85	1.53
`Q5_0`	16.194	78	1.60
`Q5_1`	15.851	81	1.68
`Q8_0`	15.652	89	2.13
`FP16`	15.623	117	2.82
`FP32`	15.623	198	5.64

Perplexity of Q4_0 became lower. Not sure why, maybe better quantization, maybe better matmul :)
Q5 is great -- low perplexity, fast, and file size is comparable to Q4.
Q8, as expected, has lowest ppl among quantized formats, but file size is almost as big as FP16, and inference is on the slower side; I guess there is no much need to use it (unless it really fits RAM and FP16 does not, which may be the case for some users).

I decided to remove Q4_1_O format, since it is worse by perplexity, latency and file size than other formats (mostly talking about Q5). It's a little sad -- I've learned much when I was developing it; but this removal will greatly simplify ggml updates.

BTW, the only thing for which I still need to fork ggml are some build issues: needed to add OBJECT to CMakeLists.txt and remove ddlimport/dllexport.

@ggerganov If not too much work and you have time, could you check the changes and maybe comment on how can I resolve build issues for which I need these changes? (I can also open a separate issue if it would be more efficient)

ggerganov · 2023-04-29T16:15:41Z

@saharNooby

Thank you for the information!

I don't have a Windows machine to test, but I think I have fixed the build issues that you experience.
Try updating ggml to the latest master: b237714

saharNooby · 2023-04-29T16:24:42Z

@ggerganov It works, thanks!

Along with rwkv.cpp, it generates ggml.dll though, which is not needed. But I will create custom option RWKV_BUILD_SHARED_LIBRARY instead of built-in BUILD_SHARED_LIBS to prevent it. After this, I can finally get rid of ggml fork and just use your main repo.

ggerganov · 2023-04-29T16:26:36Z

Yes, you probably want to build ggml as static library. Should be simpler

Awesome work on rwkv.cpp!

saharNooby mentioned this issue Apr 17, 2023

Code difference is getting more between ggml and rwkv.cpp RWKV/rwkv.cpp#25

Closed

iacore mentioned this issue Apr 20, 2023

Standalone loader rustformers/llm#125

Merged

saharNooby mentioned this issue Apr 22, 2023

Add Q4_1_O quantization format that preserves outliers in weights and does dot in FP32 RWKV/rwkv.cpp#16

Merged

ggerganov mentioned this issue Apr 28, 2023

Remove Q4_3 which is no better than Q5 ggerganov/llama.cpp#1218

Merged

saharNooby closed this as not planned Won't fix, can't repro, duplicate, stale Apr 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pulling new quantization format Q4_1_O into upstream ggml #89

Pulling new quantization format Q4_1_O into upstream ggml #89

saharNooby commented Apr 17, 2023 •

edited

Loading

ggerganov commented Apr 17, 2023

saharNooby commented Apr 17, 2023

ggerganov commented Apr 17, 2023

saharNooby commented Apr 17, 2023

saharNooby commented Apr 17, 2023

ggerganov commented Apr 17, 2023

saharNooby commented Apr 18, 2023 •

edited

Loading

ggerganov commented Apr 22, 2023

saharNooby commented Apr 22, 2023 •

edited

Loading

ggerganov commented Apr 27, 2023

saharNooby commented Apr 27, 2023

KerfuffleV2 commented Apr 27, 2023

ggerganov commented Apr 27, 2023

saharNooby commented Apr 29, 2023

ggerganov commented Apr 29, 2023

saharNooby commented Apr 29, 2023 •

edited

Loading

ggerganov commented Apr 29, 2023

Pulling new quantization format Q4_1_O into upstream ggml #89

Pulling new quantization format Q4_1_O into upstream ggml #89

Comments

saharNooby commented Apr 17, 2023 • edited Loading

ggerganov commented Apr 17, 2023

saharNooby commented Apr 17, 2023

ggerganov commented Apr 17, 2023

saharNooby commented Apr 17, 2023

saharNooby commented Apr 17, 2023

ggerganov commented Apr 17, 2023

saharNooby commented Apr 18, 2023 • edited Loading

ggerganov commented Apr 22, 2023

saharNooby commented Apr 22, 2023 • edited Loading

ggerganov commented Apr 27, 2023

saharNooby commented Apr 27, 2023

KerfuffleV2 commented Apr 27, 2023

ggerganov commented Apr 27, 2023

saharNooby commented Apr 29, 2023

ggerganov commented Apr 29, 2023

saharNooby commented Apr 29, 2023 • edited Loading

ggerganov commented Apr 29, 2023

saharNooby commented Apr 17, 2023 •

edited

Loading

saharNooby commented Apr 18, 2023 •

edited

Loading

saharNooby commented Apr 22, 2023 •

edited

Loading

saharNooby commented Apr 29, 2023 •

edited

Loading