Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q4_0 and Q4_1 quantization breaks RWKV due to weight/activation outliers #12

Closed
saharNooby opened this issue Apr 3, 2023 · 7 comments
Closed

Comments

@saharNooby
Copy link
Collaborator

saharNooby commented Apr 3, 2023

I've been measuring loss and perplexity of different sizes and data types on a very small private dataset:

rwkv.cpp-169M-Q4_0.bin                      averages: loss [3.629], perplexity  37.691
rwkv.cpp-169M-Q4_1.bin,                     averages: loss [3.163], perplexity  23.642
rwkv.cpp-169M-float16.bin,                  averages: loss [2.699], perplexity  14.861
rwkv.cpp-169M.bin,                          averages: loss [2.699], perplexity  14.861

RWKV-4-Pile-430M-20220808-8066-q4_0.bin,    averages: loss [2.911], perplexity  18.375
RWKV-4-Pile-430M-20220808-8066-q4_1.bin,    averages: loss [2.631], perplexity  13.885
RWKV-4-Pile-430M-20220808-8066-FP16.bin,    averages: loss [2.377], perplexity  10.777
RWKV-4-Pile-430M-20220808-8066-FP32.bin,    averages: loss [2.377], perplexity  10.777

RWKV-4-Pile-1B5-20220929-ctx4096-Q4_0.bin,  averages: loss [3.079], perplexity  21.745
RWKV-4-Pile-1B5-20220929-ctx4096-Q4_1.bin,  averages: loss [2.655], perplexity  14.231
RWKV-4-Pile-1B5-20220929-ctx4096-FP16.bin,  averages: loss [2.060], perplexity   7.847
RWKV-4-Pile-1B5-20220929-ctx4096-FP32.bin,  averages: loss [2.060], perplexity   7.847

RWKV-4-Pile-3B-20221110-ctx4096-Q4_0.bin,   averages: loss [4.689], perplexity 108.724
RWKV-4-Pile-3B-20221110-ctx4096-Q4_1.bin,   averages: loss [2.916], perplexity  18.475
RWKV-4-Pile-3B-20221110-ctx4096-FP16.bin,   averages: loss [2.067], perplexity   7.901

RWKV-4-Pile-7B-20230109-ctx4096-Q4_0.bin,   averages: loss [6.296], perplexity 542.322
RWKV-4-Pile-7B-20230109-ctx4096-Q4_1.bin,   averages: loss [3.017], perplexity  20.423

The measuring method may not be entirely correct, but these huge losses and perplexities really do show in the quality of generated text -- it is almost incoherent.

Of course, we need proper measuring on WikiText; but it would be very slow on my hardware, and WikiText is not representative of my use case.

Interesting thing to note are min and max values of RWKV matrix weights:

169M: -13.8750 14.0000
430M: -14.5000 14.9375
1.5B: -27.2500 27.3750
3B: -12.6875 14.1250

For comparison, llama 7B min and max values are around -2.5 2.5!

As a next step, I'll try to determine whether these huge values are outliers, or most weights really are distributed in this range.

I guess we need alternative quantization scheme for RWKV.

@saharNooby
Copy link
Collaborator Author

If we look at percentiles of 3B model, 0.001-0.999 percentiles have modest values:

blocks.22.att.key.weight [2048, 2048]
	| 0.001=-0.9727 | 0.01=-0.6680 | 0.05=-0.4414 | 0.10=-0.3340 | 0.50=-0.0010 | 0.90=0.3320 | 0.95=0.4395 | 0.99=0.6641 | 0.999=0.9961
blocks.22.att.value.weight [2048, 2048]
	| 0.001=-1.4922 | 0.01=-1.0234 | 0.05=-0.6836 | 0.10=-0.5195 | 0.50=0.0003 | 0.90=0.5195 | 0.95=0.6836 | 0.99=1.0234 | 0.999=1.4922
blocks.22.att.receptance.weight [2048, 2048]
	| 0.001=-0.7812 | 0.01=-0.5469 | 0.05=-0.3672 | 0.10=-0.2773 | 0.50=0.0046 | 0.90=0.2871 | 0.95=0.3750 | 0.99=0.5508 | 0.999=0.7812
blocks.22.att.output.weight [2048, 2048]
	| 0.001=-1.2500 | 0.01=-0.8789 | 0.05=-0.6016 | 0.10=-0.4609 | 0.50=0.0001 | 0.90=0.4609 | 0.95=0.6016 | 0.99=0.8789 | 0.999=1.2500
blocks.22.ffn.key.weight [8192, 2048]
	| 0.001=-0.9805 | 0.01=-0.7227 | 0.05=-0.5039 | 0.10=-0.3906 | 0.50=0.0005 | 0.90=0.3906 | 0.95=0.5039 | 0.99=0.7227 | 0.999=0.9844
blocks.22.ffn.receptance.weight [2048, 2048]
	| 0.001=-0.8047 | 0.01=-0.6016 | 0.05=-0.4238 | 0.10=-0.3281 | 0.50=-0.0008 | 0.90=0.3262 | 0.95=0.4199 | 0.99=0.5977 | 0.999=0.8008
blocks.22.ffn.value.weight [2048, 8192]
	| 0.001=-0.9688 | 0.01=-0.7109 | 0.05=-0.4961 | 0.10=-0.3848 | 0.50=0.0001 | 0.90=0.3867 | 0.95=0.4980 | 0.99=0.7148 | 0.999=0.9727

And min 0.001/max 0.999 among all matrices are -1.5312 1.5312, respectively.

Looks like huge values are rare outliers, and most other values are in range -1.53 .. 1.53.

@saharNooby
Copy link
Collaborator Author

saharNooby commented Apr 3, 2023

Some links that may be useful for research/impl:

@saharNooby
Copy link
Collaborator Author

saharNooby commented Apr 4, 2023

Here is a hacky way to deal with outlier weights/activations: commit in an experimental branch (do not use unless you know what you are doing!)

What I did:

  • stored outliers in Q4_1 block explicitly: an outlier is just a single absmax value in a block; it is not quantized and stored as-is
  • deoptimized Q4_1 matmul: it was quantize activations -> quantized dot, now it is dequantize weights row -> FP32 dot
  • disabled quantization of emb.weight -- for 14B model, we would save 305 MB by quantizing it from FP16, I guess it is not worth the quality decrease

It is now slower because of deoptimization, but perplexity is significantly lower:

rwkv.cpp-169M-Q4_0.bin,                      averages: loss [3.629], perplexity  37.691
rwkv.cpp-169M-Q4_1.bin,                      averages: loss [3.163], perplexity  23.642
rwkv.cpp-169M-EXPERIMENTAL-q4_1-....bin,     averages: loss [2.787], perplexity  16.231
rwkv.cpp-169M-float16.bin,                   averages: loss [2.699], perplexity  14.861
rwkv.cpp-169M.bin,                           averages: loss [2.699], perplexity  14.861

I did not test it with larger models yet. I'll continue the experiments...

@saharNooby saharNooby changed the title Q4_0 and Q_1 quantization breaks RWKV (probably, due to huge absolute weights) Q4_0 and Q_1 quantization breaks RWKV due to weight/activation outliers Apr 4, 2023
@saharNooby saharNooby changed the title Q4_0 and Q_1 quantization breaks RWKV due to weight/activation outliers Q4_0 and Q4_1 quantization breaks RWKV due to weight/activation outliers Apr 4, 2023
@saharNooby saharNooby pinned this issue Apr 5, 2023
@saharNooby
Copy link
Collaborator Author

saharNooby commented Apr 5, 2023

Results so far for all models (169M to 7B, no data for 7B FP16 because not enough memory):

169M-Q4_0.bin,                  loss [3.629], perplexity  37.691
169M-Q4_1.bin,                  loss [3.163], perplexity  23.642
169M-EXPERIMENTAL-q4_1-....bin, loss [2.787], perplexity  16.231
169M-float16.bin,               loss [2.699], perplexity  14.861

430M-20220808-8066-q4_0.bin,    loss [2.911], perplexity  18.375
430M-20220808-8066-q4_1.bin,    loss [2.631], perplexity  13.885
430M-...-EXPERIMENTAL-q4_1.bin, loss [2.452], perplexity  11.614
430M-20220808-8066-FP16.bin,    loss [2.377], perplexity  10.777

1B5-20220929-ctx4096-Q4_0.bin,  loss [3.079], perplexity  21.745
1B5-20220929-ctx4096-Q4_1.bin,  loss [2.655], perplexity  14.231
1B5-...-EXPERIMENTAL-Q4_1.bin,  loss [2.204], perplexity   9.060
1B5-20220929-ctx4096-FP16.bin,  loss [2.060], perplexity   7.847

3B-20221110-ctx4096-Q4_0.bin,   loss [4.689], perplexity 108.724
3B-20221110-ctx4096-Q4_1.bin,   loss [2.916], perplexity  18.475
3B-...-EXPERIMENTAL-Q4_1.bin,   loss [2.406], perplexity  11.093
3B-20221110-ctx4096-FP16.bin,   loss [2.067], perplexity   7.901

7B-20230109-ctx4096-Q4_0.bin,   loss [6.296], perplexity 542.322
7B-20230109-ctx4096-Q4_1.bin,   loss [3.017], perplexity  20.423
7B-...-EXPERIMENTAL-Q4_1.bin,   loss [2.381], perplexity  10.815

Experimental Q4_1 (as in previous message -- stores outliers in a block as-is and does not quantize activations) clearly reduces perplexity as compared to vanilla Q4_1, but it is still not worth running 7B INT4 instead of 3B FP16 or even 1.5B FP16.

I have ideas about performance optimization, but performance does not matter when it's still better for quality to run smaller models.

I'll try to invent some new ideas for changing quantization format even further.

@bennmann
Copy link

bennmann commented Apr 5, 2023

hahnyuan/RPTQ4LLM#1 RPTQ method may be worth keeping an eye on, their repo is new quantized SOTA

Hope this helps fuel your implementation too

@BlinkDL
Copy link

BlinkDL commented Apr 7, 2023

Can try this for INT4: compute "mx my rx ry" as in https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py

Basically: rescale all rows & columns of w --> compute INT4 x @ w --> rescale result.

Probably you only need rx & ry, and you can compute them using max(abs(w)).

And probably only need them for att.output.weight (maybe ffn.value.weight too).

@saharNooby
Copy link
Collaborator Author

saharNooby commented Apr 8, 2023

In the end, I decided to create a separate format in ggml called Q4_1_O, and finish experimenting with quantization for now.

Comparison of Q4_1_O with FP16:

Perplexity

Lower is better. See "Perplexity measuring setup" section below for set-up.

Overall, rule "it's better to use quantized model X than FP16 model X-1" holds.

Parameters, M Q4_1_O FP16
169 16.700 15.623
430 12.278 11.802
1500 8.962 8.609
3000 8.288 7.736
7000 7.539 OOM

2023-04-08_15-45-29

Performance (per-token latency in ms)

Lower is better. Tests done on 16 GB RAM, 4 core/8 thread machine.

Overall, Q4_1_O is 2x slower than FP16 — at the level of FP32.

Parameters, M Q4_1_O FP16
169 18 13
430 57 36
1500 232 124
3000 453 248
7000 1141 OOM

2023-04-08_15-45-35

Disk/memory consumption

Q4_1_O has the same overhead as Q4_1 — 24 byte block stores 32 quantized values, which gives 0.75 bytes per value. For comparison, FP16 is 2 bytes per value, perfect INT8 is 1 byte, perfect INT4 is 0.5 bytes.

Reflection and future work

What did work (in order of decreasing importance):

  • doing dot product in FP32 instead of quantized format. That way, we don't botch the activations that have outliers. It makes inference significantly slower, matching that of FP32.
  • saving single outlier value per block as-is, and then quantizing rest of the values as if there was no outlier. min and delta fields were reduced from FP32 to FP16 (no quality loss here — all RWKV models are in FP16 anyway), and freed up 4 bytes are used for storing outlier index and outlier value. That way, there is no increase in file size/memory consumption compared to Q4_1.
  • not quantizing head.weight. It takes not much space in bigger models, but when quantized, significantly increases perplexity. I figured it is not worth to quantize it.

What did not work:

  • scaling columns only — it slightly improves Q4_0 and Q4_1, slightly degrades Q4_1_O. I decided it was not worth the additional code complexity. The code is available in column-scaling branch.

What I did not try (this is left for the future work):

  • splitting regular and outlier matmul: do regular matmul in quantized, and outlier matmul in FP16/FP32, like in LLM.int8.
  • scaling some weight and activation channels, like in SmoothQuant.
  • doing some complicated linear algebra like in GPT-Q (not a critique of the paper, rather a critique of my own level of understanding)
  • sort/reorder columns like in RPTQ.
  • scaling the whole matrix, as BlinkDL suggested.

Moreover, AVX2 implementation of Q4_1_O matmul does not look and feel optimal, and probably can be improved with some asm magic. It's at the limit of my ability to write vectorized code tho.

Perplexity measuring setup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants