Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pulling new quantization format Q4_1_O into upstream ggml #89

Closed
saharNooby opened this issue Apr 17, 2023 · 17 comments
Closed

Pulling new quantization format Q4_1_O into upstream ggml #89

saharNooby opened this issue Apr 17, 2023 · 17 comments

Comments

@saharNooby
Copy link

saharNooby commented Apr 17, 2023

When developing rwkv.cpp, I've discovered that existing quantization formats Q4_0 and Q4_1 break RWKV (that is, perplexity becomes 10x higher and the output is garbage). I've documented my observations in this issue. Looks like this is caused both by outliers in weights, and outliers in activations.

To solve this, I've created a new format Q4_1_O. Commit in rwkv.cpp. Comparisons.

Most important things about the format:

  • it is based on Q4_1
  • it stores min & delta values in FP16, not FP32
  • per 32-element block, it losslessly stores a single absmax FP16 value (called "outlier") and its index in the block; all other values are quantized as if there was no outlier
  • matmul is done in FP32, that is, I dequantize the matrix and multuply it by activations already in FP32
  • per-token latency is the same as FP32 40% slower than FP16 (on my machine)
  • perplexity is, as expected with any quantization, slightly higher than FP16, but principle "it's better to use quantized X+1 model than FP16 X model" holds

TL;DR: store single outlier value per block unquantized; dot in FP32.


Recently, it became clear that my ggml fork and upstream ggml (in llama.cpp/here) began to greatly diverge: Code difference is getting more between ggml and rwkv.cpp.

I would like to keep interventions in my copy of ggml as small as possible, so I can pull latest optimizations/fixes without the need to apply all my changes again.

Specifically, I ask: does it sound like Q4_1_O format belongs to upstream ggml? If so, I can create a PR here.

@ggerganov
Copy link
Owner

matmul is done in FP32, that is, I dequantize the matrix and multuply it by activations already in FP32

Do you have an estimate of how much that change affected the perplexity in isolation from the Q4_1_0 format?

We are doing some active work in llama.cpp on improving the quantization accuracy of the existing formats and one of the recent conclusions was that doing the mat mul in 8-bit format is almost equivalent to doing it in 32-bit format in terms of quality and as fast as doing it in 4-bit format (the one used when you made the fork). Additionally, doing the mat mul in 32-bit format significantly improves the perplexity compared to the original 4-bit mat mul.

All this is upstream in ggml - not sure how hard would it be for you to give it a try.

There are additional improvements pending to the Q4_0 and potentially Q4_1 format that I am hoping will make the accuracy between quantized and non-quantized models even more similar.

And an additional idea for keeping one of the tensors in full-precision is also likely to be added. But this is not related to ggml directly - rather to the transformer. Not sure if it is applicable to RWKV. My initial experiments show that doing this is currently enough to make the smallest GPT-2 117M work with quantization, instead of breaking down (haven't measured perplexity though).

With that said, I am hoping after all this work is done to try and see if RWKV still breaks down using the existing formats.
I understand the weight distribution in RWKV has outliers and this will likely still not work, but think it is worth it to give it a try before adding Q4_1_0 to the code base. Adding it would increase development and maintenance efforts.

There also might be an option of replacing Q4_1 with Q4_1_0, but cannot say for sure yet.

@saharNooby
Copy link
Author

Do you have an estimate of how much that change affected the perplexity in isolation from the Q4_1_0 format?

As a rough estimate, here is RWKV 169M ppl on a small, private dataset:

Vanilla Q4_1:                           perplexity  23.642
FP32 dot:                               perplexity  18.402
With outliers preservation, FP32 dot:   perplexity  16.231
FP16:                                   perplexity  14.861

FP32 dot greatly reduces perplexity, but preserving outlier weights is important too.

And an additional idea for keeping one of the tensors in full-precision is also likely to be added

This is relevant for RWKV and has been already applied -- we decided not to quantize both emb and head (the unembedding matrix), because they are not too big in comparison to the rest of the model, and quantizing them is not worth the reduction in quality.

but think it is worth it to give it a try before adding Q4_1_0 to the code base

I understand, sounds reasonable.

There's work to do even before attempting to pull Q4_1_O: I need to depend on a submodule instead of copied files; exp, sigmoid and max (maybe also 1_minus_x) operators needs to be pulled into upstream. After this is done, I can do comparisons of Q4_1_O with all other formats with latest improvements.

(BTW, it's Q4_1_O ("oh", for "outliers"), not Q4_1_0 (zero))

@ggerganov
Copy link
Owner

We now support 1D and 2D custom mapping operators in ggml:

ggml/include/ggml/ggml.h

Lines 660 to 675 in cc91df0

// Mapping operations
typedef void (*ggml_unary_op_f32_t)(const int, float *, const float *);
typedef void (*ggml_binary_op_f32_t)(const int, float *, const float *, const float *);
struct ggml_tensor * ggml_map_unary_f32(
struct ggml_context * ctx,
struct ggml_tensor * a,
const ggml_unary_op_f32_t fun);
struct ggml_tensor * ggml_map_binary_f32(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
const ggml_binary_op_f32_t fun);

One option is to implement exp, sigmoid, etc with those mapping and this way keep their implementation inside RWKV. Of course, we can also just add them as operators in ggml. It's a matter of some balance - in early development stage it's better to have a smaller codebase to deal with, so we can iterate more quickly. After we know the common use cases, we can make appropriate refactorings and expand further the functionality.

These results are very useful.
My expectation is that after ggerganov/llama.cpp#995 , the new Q4_0 with 2x F16 factors will give you ~18.4 ppl on that test. I think we should try this out when it is ready, to confirm that all these analysis are reasonable and decide how to proceed.

@saharNooby
Copy link
Author

One option is to implement exp, sigmoid, etc with those mapping

Already did it, works flawlessly! Thanks @KerfuffleV2. It really saved my time.

I will now focus on porting Q4_1_O to my up-to-date fork of ggml.

to confirm that all these analysis are reasonable

I'll try to document exactly what setup I use for perplexity measure, so it's reproducible. Unfortunately, I don't want to run wikitext perplexity tests because they take days, so I do much smaller tests on a single file of ~4K tokens. Not ideal, but I beleive it is still representative and good enough.

@saharNooby
Copy link
Author

I've updated ggml in rwkv.cpp to the latest version and did more rigorous measurements.

Measuring set-up.

Pexplexity for RWKV 169M:

Data type Perplexity
Q4_0 18.599
Q4_1 19.389
Q4_1_O 16.700
FP16 15.623
FP32 15.623

Per-token latency for RWKV 1B5 on Windows with AVX2:

Data type Latency, ms
Q4_0 59
Q4_1 102
Q4_1_O 144
FP16 116
FP32 207

Interestingly, Q4_0 is now better for perplexity than Q4_1 -- this was not the case before. And it's very fast.

Q4_1_O, for some reason, became faster after the update, and is now somewhat slower than FP16. And it's still the best by perplexity among quantized formats.


I'll wait until Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors is done, and will pull the changes and redo the measurements.

@ggerganov
Copy link
Owner

The Q4_1 is slower because it is not yet using the intermediate 8-bit quantization as Q4_0 (ggerganov/llama.cpp#951). It is still quantizing down to 4-bits for the mat mul, which we know is not accurate. The implementation is postponed until we figure out if Q4_0 can become even better

@saharNooby
Copy link
Author

saharNooby commented Apr 18, 2023

@ggerganov Hi! I have a small question about ggml. When calculating work_size for matmul node, do I need to multiply the size by thread count? I need a per-thread temporary buffer of size elements_in_vector * sizeof(float), buffers should not overlap between threads. Commit with context

BTW, Q4_1_O was just optimized again and is now as fast as FP16, yay! Edit: actually 20% slower than FP16, but is still 10% faster for 7B model than previous implementation

@ggerganov
Copy link
Owner

@saharNooby

If it is not too much work for you, could you perform the perplexity / speed tests on the new Q4_2 and Q4_3 quantization formats - would be interesting to see how things improve for RWKV. Also Q4_1 should be faster now on latest master

@saharNooby
Copy link
Author

saharNooby commented Apr 22, 2023

ggml version. Measuring set-up.

Perplexity for RWKV 169M:

Data type Before After
Q4_0 18.599 18.599
Q4_1 19.389 17.187
Q4_1_O 16.700 16.700
Q4_2 N/A 17.060
Q4_3 N/A 16.850
FP16 15.623 15.623
FP32 15.623 15.623

Per-token latency for RWKV 1B5 on Windows with AVX2:

Data type Before, ms After, ms
Q4_0 59 64
Q4_1 102 71
Q4_1_O 144 141
Q4_2 N/A 85
Q4_3 N/A 95
FP16 116 117
FP32 207 198

Q4_1 became better and faster.

Q4_1_O still has the best perplexity among quantized formats, but considering its speed, it's not clear whether it would be more useful than Q4_3. Q4_3 is very close in perplexity.

I need to do more testing to decide whether Q4_1_O is not needed anymore, and Q4_3 should be used instead.

@ggerganov
Copy link
Owner

@saharNooby

There are now new Q5_0 and Q5_1 quantization methods available.
See our evaluation for the LLaMA model:

https://github.com/ggerganov/llama.cpp#quantization

Would be interesting to see how they perform with RWKV

@saharNooby
Copy link
Author

@ggerganov Great! I guess I'll need to test Q8_0 too for the full picture.

Do you plan to add more quantization formats in the near future?

@KerfuffleV2
Copy link

A day without finding a new quantization format just means you forgot to pull the repo.

(I actually love the rapid progress and iteration, so don't take that as any kind of complaint.)

@ggerganov
Copy link
Owner

I think for the near future we will support these formats.
Still need to evaluate how well they perform and if we really need all of them.

The Q8_0 format is currently optimized only for ARM NEON, so if you are on x86 it will be very slow.
Will soon add AVX optimization.

@saharNooby
Copy link
Author

Tested Q5 and Q8 formats with the same settings:

Format Perplexity (169M) Latency, ms (1.5B) File size, GB (1.5B)
Q4_0 17.507 76 1.53
Q4_1 17.187 72 1.68
Q4_1_O 16.700 141 1.68
Q4_2 17.060 85 1.53
Q5_0 16.194 78 1.60
Q5_1 15.851 81 1.68
Q8_0 15.652 89 2.13
FP16 15.623 117 2.82
FP32 15.623 198 5.64
  • Perplexity of Q4_0 became lower. Not sure why, maybe better quantization, maybe better matmul :)
  • Q5 is great -- low perplexity, fast, and file size is comparable to Q4.
  • Q8, as expected, has lowest ppl among quantized formats, but file size is almost as big as FP16, and inference is on the slower side; I guess there is no much need to use it (unless it really fits RAM and FP16 does not, which may be the case for some users).

I decided to remove Q4_1_O format, since it is worse by perplexity, latency and file size than other formats (mostly talking about Q5). It's a little sad -- I've learned much when I was developing it; but this removal will greatly simplify ggml updates.


BTW, the only thing for which I still need to fork ggml are some build issues: needed to add OBJECT to CMakeLists.txt and remove ddlimport/dllexport.

@ggerganov If not too much work and you have time, could you check the changes and maybe comment on how can I resolve build issues for which I need these changes? (I can also open a separate issue if it would be more efficient)

@saharNooby saharNooby closed this as not planned Won't fix, can't repro, duplicate, stale Apr 29, 2023
@ggerganov
Copy link
Owner

@saharNooby

Thank you for the information!

I don't have a Windows machine to test, but I think I have fixed the build issues that you experience.
Try updating ggml to the latest master: b237714

@saharNooby
Copy link
Author

saharNooby commented Apr 29, 2023

@ggerganov It works, thanks!

Along with rwkv.cpp, it generates ggml.dll though, which is not needed. But I will create custom option RWKV_BUILD_SHARED_LIBRARY instead of built-in BUILD_SHARED_LIBS to prevent it. After this, I can finally get rid of ggml fork and just use your main repo.

@ggerganov
Copy link
Owner

Yes, you probably want to build ggml as static library. Should be simpler

Awesome work on rwkv.cpp!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants