4-bit Integer quantisation #27

ggerganov · 2023-02-26T16:38:53Z

We introduce efficient SIMD 4-bit integer quantisation running on the CPU

First some initial results on M1 Pro:

Language Models:

Model	Params	Size (old)	Time / Token (old)	Size (new)	Time / Token (new)
GPT-2	1558 M	2976 MB	42 ms	937 MB	17 ms
GPT-J	6 B	11543 MB	125 ms	3610 MB	46 ms

Here is a short sample run of `GPT-J` inference of 100 tokens: (click to expand)

$ ./bin/gpt-j -m models/gpt-j-6B/ggml-model-q4_0.bin -p "This pull request imlpements integer quantization." -t 8 -n 100

main: seed = 1677426680
gptj_model_load: loading model from 'models/gpt-j-6B/ggml-model-q4_0.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: memory_size =  1792.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285
main: number of tokens in prompt = 15

This pull request imlpements integer quantization. We can see that in a lot of cases, we can get at least a one line of code reduction without changing semantics in any way.

To be more explicit about the trade-offs in our analysis. We can see that it is possible to get about a 70% reduction in execution time, and a 25% reduction in memory usage, while adding only about a 1.5% reduction in code size, and only incresing the number of branches.

This is a trade

main: mem per token = 16041732 bytes
main:     load time =  1187.43 ms
main:   sample time =    14.53 ms
main:  predict time =  5199.36 ms / 45.61 ms per token
main:    total time =  6581.01 ms

Whisper:

Model	Params	Size (old)	Mem (old)	Size (new)	Mem (new)
Whisper Tiny	39 M	74 MB	127 MB	26 MB	79 MB
Whisper Base	74 M	141 MB	215 MB	48 MB	123 MB
Whisper Small	244 M	465 MB	603 MB	153 MB	291 MB
Whisper Medium	769 M	1462 MB	1720 MB	469 MB	726 MB
Whisper Large	1550 M	2951 MB	3336 MB	939 MB	1324 MB

Here is a short `Whisper Medium` run: (click to expand)

$ ./bin/whisper -m models/whisper-medium/ggml-model-q4_0.bin -f ../../whisper.cpp/samples/jfk.wav -t 8

whisper_init_from_file: loading model from 'models/whisper-medium/ggml-model-q4_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = q4_0
whisper_model_load: type          = 4
whisper_model_load: mem required  =  726.00 MB (+   43.00 MB per decoder)
whisper_model_load: kv self size  =   42.00 MB
whisper_model_load: kv cross size =  140.62 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  468.71 MB
whisper_model_load: model size    =  468.48 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 

main: processing '../../whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:08.040]   And so my fellow Americans, ask not what your country can do for you,
[00:00:08.040 --> 00:00:10.900]   ask what you can do for your country.


whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   221.70 ms
whisper_print_timings:      mel time =     8.65 ms
whisper_print_timings:   sample time =    13.65 ms /    29 runs (    0.47 ms per run)
whisper_print_timings:   encode time =  1994.48 ms /     1 runs ( 1994.48 ms per run)
whisper_print_timings:   decode time =   305.18 ms /    29 runs (   10.52 ms per run)
whisper_print_timings:    total time =  2560.79 ms

Details

Integer quantisation is a technique used to reduce the model size at the price of some accuracy. Instead of using floating point number to represent the weights of the model, one can use integers + scaling/offset factors to compress them.

There are different ways to perform the quantisation. In this PR, I investigated the following approaches:

Q4_0

A block of QK floating point numbers x_i is represented by 1 scaling factor (f32) + QK/2 bytes. Each byte stores 2 4-bit integer scaling factors in the range [-7, 7]. The f32 scaling factor is determined as abs(max(x_i))/7. The compression ratio achieved with this approach compared to simple f16 storage is:

C = (4 + QK/2)/(2*QK)

ggml/src/ggml.c

Lines 411 to 439 in c686d70

 // scalar 

 for (int i = 0; i < nb; i++) { 

 float amax = 0.0f; // absolute max 

 for (int l = 0; l < QK; l++) { 

 const float v = x[i*QK + l]; 

 amax = MAX(amax, fabsf(v)); 

 } 

 const float d = amax / ((1 << 3) - 1); 

 const float id = d ? 1.0f/d : 0.0f; 

 pd[i] = d; 

 for (int l = 0; l < QK; l += 2) { 

 const float v0 = x[i*QK + l + 0]*id; 

 const float v1 = x[i*QK + l + 1]*id; 

 const uint8_t vi0 = ((int8_t) (round(v0))) + 8; 

 const uint8_t vi1 = ((int8_t) (round(v1))) + 8; 

 assert(vi0 >= 0 && vi0 < 16); 

 assert(vi1 >= 0 && vi1 < 16); 

 pp[l/2] = vi0 | (vi1 << 4); 

 } 

 memcpy(pb + i*QK/2, pp, sizeof(pp)); 

 }

Q4_1

Here we use 1 scaling factor (f32) together with 1 offset factor (f32). The f32 offset factor is determined as the min(x_i), while the f32 scaling factor is now: (max(x_i) - min(x_i))/15. The integer factors are again packed into QK/2 bytes, but this time their range is in [0, 15]. The compression ratio compared to simple f16 storage is:

C = (8 + QK/2)/(2*QK)

ggml/src/ggml.c

Lines 443 to 488 in c686d70

 // method 4 

 // blocks of QK elements 

 // represented with 2 floats (min + delta) and QK/2 8-bit ints (i.e QK 4-bit unsigned integer factors) 

 void quantize_row_q4_1(const float * restrict x, void * restrict y, int k) { 

 assert(k % QK == 0); 

 const int nb = k / QK; 

 float * restrict pm = (float *) (y); 

 float * restrict pd = (float *) (pm + nb); 

 uint8_t * restrict pb = (uint8_t *) (pd + nb); 

 uint8_t pp[QK/2]; 

 for (int i = 0; i < nb; i++) { 

 float min = FLT_MAX; 

 float max = -FLT_MAX; 

 for (int l = 0; l < QK; l++) { 

 const float v = x[i*QK + l]; 

 if (v < min) min = v; 

 if (v > max) max = v; 

 } 

 const float d = (max - min) / ((1 << 4) - 1); 

 const float id = d ? 1.0f/d : 0.0f; 

 pm[i] = min; 

 pd[i] = d; 

 for (int l = 0; l < QK; l += 2) { 

 const float v0 = (x[i*QK + l + 0] - min)*id; 

 const float v1 = (x[i*QK + l + 1] - min)*id; 

 const uint8_t vi0 = round(v0); 

 const uint8_t vi1 = round(v1); 

 assert(vi0 >= 0 && vi0 < 16); 

 assert(vi1 >= 0 && vi1 < 16); 

 pp[l/2] = vi0 | (vi1 << 4); 

 } 

 memcpy(pb + i*QK/2, pp, sizeof(pp)); 

 } 

 }

This approach should be more accurate compared to Q4_0, but it comes at some extra computations due to the offset factor. For the moment, the plan is to support both quantisation approaches, since it is not clear which one is superior.

GQ

I also did a few experiments with general n-bit quantisation. However, I didn't reach to a proper technique that would allow to vectorise the implementation using SIMD efficiently, so I decided it is not worth it in the end. Most of the attempts can be found in: https://github.com/ggerganov/ggml/blob/gq/tests/test-mul-mat2.c

Choosing QK

The tradeoff when selecting the optimal value for QK is if you choose it too high, then the compression ratio is better, but you lose accuracy. Additionally, not all QK values can be implemented efficiently - it depends on the available CPU instruction set.

So far, I decided to choose QK = 32 for 128-bit ARM_NEON - it seems this size is more compatible with the available SIMD intrinsics/registers. For AVX2 support, I think QK = 64 might turn out to be a better fit for the 256-bit registers. However, if the performance difference between QK = 32 and QK = 64 is not very large, I might end up using QK = 32 for all architectures - it will make the code significantly simpler.

Running

First, convert an existing F16 or F32 ggml model to 4-bit quantised one:

# quantize GPT-2 model using Q4_0
./bin/gpt-2-quantize ./ggml-model.bin ./ggml-model-q4_0.bin 2

# quantize GPT-2 model using Q4_1
./bin/gpt-2-quantize ./ggml-model.bin ./ggml-model-q4_1.bin 3

# quantize GPT-J model using Q4_0
./bin/gpt-j-quantize ./ggml-model.bin ./ggml-model-q4_0.bin 2

# quantize GPT-J model using Q4_1
./bin/gpt-j-quantize ./ggml-model.bin ./ggml-model-q4_1.bin 3

# quantize Whisper model using Q4_0
./bin/whisper-quantize ./ggml-model.bin ./ggml-model-q4_0.bin 2

# quantize Whisper model using Q4_1
./bin/whisper-quantize ./ggml-model.bin ./ggml-model-q4_1.bin 3

Note: The format of the GPT-2 and GPT-J ggml model files has been changed in this PR, so you cannot directly use an existing model file. You will have to create a new one, using the updated python scripts in this branch.
The Whisper models on the other hand are still compatible, so you can quantise them directly.

You can now simply use the generated quantised model files instead of the regular models as usual.

Implementation progress

Q4_0

Scalar
ARM_NEON
AVX2
WASM SIMD

Q4_1

Scalar
ARM_NEON
AVX2
WASM SIMD

ocordeiro · 2023-03-01T14:31:54Z

How to run with GPT-J-6B model?
I'm getting the following error:

gptj_model_load: tensor 'transformer.h.0.mlp.fc_in.weight' has wrong shape in model file: got [4096, 16384], expected [16384, 4096]

Steps to reproduce:

#  Get this branch
git checkout gq && git pull

# Build GPT-J and GPT-J-quantize
make gpt-j && make gpt-j-quantize

# Download GPT-J-6B model
./examples/gpt-j/download-ggml-model.sh 6B

# Quantize GPT-J-6B model 
./bin/gpt-j-quantize ../models/gpt-j-6B/ggml-model.bin ../gpt-j-ggml-model-q4_0.bin 2

#  Run GPT-J-6B model
./build/bin/gpt-j -m ./gpt-j-ggml-model-q4_0.bin -p "This is an example"

Environment: M1 Air - macOS 13.2

ggerganov · 2023-03-02T08:14:56Z

@ocordeiro

Due to the quantization changes, I had to transpose a few of the tensors in the model.
So this makes the old ggml files incompatible with the quantization branch.

In order to make it work, you have to convert the original H5 data using the convert-h5-to-ggml.py from this branch.
To do that, you need to download the full GPT-J model from here: https://huggingface.co/EleutherAI/gpt-j-6B
And run the command:

python3 examples/gpt-j/convert-h5-to-ggml.py ./models/gpt-j-6B 0

After you convert the python model to ggml model, you can then use the gpt-j-quantize command to quantize the ggml model.

The process is a bit tedious now, but when the implementation is ready, I will upload the quantized models to Hugging Face and it will be easier.

ocordeiro · 2023-03-02T11:22:55Z

Great. Thank you very much for the explanation. I will do this

ocordeiro · 2023-03-04T20:17:01Z

it worked and it's impressive.
here are the results on my M1 Air 8GB:

main: mem per token = 15976132 bytes
main:     load time =  2016.22 ms
main:   sample time =    32.71 ms
main:  predict time = 18798.93 ms / 92.61 ms per token
main:    total time = 21609.82 ms

tmzt · 2023-03-05T20:54:34Z

@ocordeiro or anyone else,

can you upload the ggml weights to HF, bittorrent, etc.?

ocordeiro · 2023-03-05T23:18:24Z

@tmzt it's here until @ggerganov doesn't launch official version:
https://huggingface.co/ocordeiro/ggml-gpt-j-6b-q4_0

Const-me · 2023-03-11T01:31:59Z

I’m not sure I fully understood your spec, but here’s AVX2 decompressor for these blocks:
https://gist.github.com/Const-me/a0529a8c9885d371138a1c50e0622040
Tested very little, haven’t tested performance at all, but still, it seems to work for that one test which I have implemented.
Feel free to copy-paste.

ggerganov · 2023-03-11T06:04:30Z

@Const-me
Awesome! Thank you for this.
During the inference the most crucial parts that have to run fast are:

quantize_row_q4_0():
https://github.com/ggerganov/ggml/blob/gq/src/ggml.c#L352-L480
ggml_vec_dot_q4_0():
https://github.com/ggerganov/ggml/blob/gq/src/ggml.c#L1172-L1384

For the first one, I have this version, but I don't know if it is optimal yet:

https://github.com/ggerganov/ggml/blob/gq/tests/test-mul-mat2.c#L2038-L2113

For the second one, I have a version for QK == 64, but I need one for QK == 32:

https://github.com/ggerganov/ggml/blob/gq/tests/test-mul-mat2.c#L1816-L1870

Any advice on the implementation and making it more efficient will be appreciated!

Const-me · 2023-03-11T15:26:22Z

@ggerganov Here’s the codes.
https://gist.github.com/Const-me/65ff46c31553493d13fcd6646e162494

The implementation of quantize_row_q4_0 is in compressRow40 function in that source file.

The implementation of ggml_vec_dot_q4_0 is in the dotProductCompressed40 function in that source file.

Again, tested very little so could be bugs there, and I have not measured performance.

Couple general notes.

About that particular block compression, I recommend interleaving the data. Microsoft does exactly that in their 2D compressed data structures. So the Q4_0 block gonna take 20 bytes, first 4 bytes is the scaling, another 16 bytes is the values.

Another thing, I don’t understand why are you multiplying two compressed rows? I would expect only the model to be compressed (because using tons of memory, and the compression can be completed offline), but all intermediate tensors be uncompressed FP32 (or at least FP16, upcasting/downcasting vectors is one fast instruction).

Generally speaking, I think your CPU matrix multiplication code can be improved by a large factor. Take a look how I did that for the hybrid model of Whisper (currently disabled with a macro, but should work):
https://github.com/Const-me/Whisper/blob/master/Whisper/CPU/mulMatImpl.h
And the rest of the mulMat*.* files in that folder.
That implementation is very specialized, only supports FP32*FP16, and I only tested it for decode step of the algorithm. But still, it’s substantially faster than what’s in GGML.

Also, see that answer on stackoverflow: https://stackoverflow.com/a/75567894/126995 I wrote that answer for matrix*vector product, but it is possible to use similar memory layout for matrix*matrix as well.

ggerganov · 2023-03-11T16:08:38Z

@Const-me

Thank you so much - you are the best!

I just added AVX2 support to llama.cpp thanks to your code snippets: ggerganov/llama.cpp@f1eaff4

About that particular block compression, I recommend interleaving the data. Microsoft does exactly that in their 2D compressed data structures. So the Q4_0 block gonna take 20 bytes, first 4 bytes is the scaling, another 16 bytes is the values.

Already did that today in the llama.cpp repo - it was necessary for consolidating the larger LLaMA models anyway.
Will need to migrate the changes here at some point.

Another thing, I don’t understand why are you multiplying two compressed rows? I would expect only the model to be compressed (because using tons of memory, and the compression can be completed offline), but all intermediate tensors be uncompressed FP32 (or at least FP16, upcasting/downcasting vectors is one fast instruction).

The idea is to reduce memory bandwidth. I think the computation becomes memory-bound on many cores. So it is more important to reduce data size rather than optimizing the calculations. I could be wrong ..

Generally speaking, I think your CPU matrix multiplication code can be improved by a large factor.

I know! I started doing this with very little knowledge about GEMM and I am sure there is a lot of room for improvements.
Thank you again for all your help.

Edit: fixed wrong quotes

Const-me · 2023-03-11T19:11:47Z

@ggerganov About the compression for intermediate tensors, I’ve made another function if you want to try, dotProduct_q40_f16 I’m not sure what you’ll find, but it’s possible FP16 intermediates might be slightly faster than Q4 compressed.

That block compression is slower than downcasting floats to FP16. And processors often have many megabytes of L3 cache, for example my processor has 16MB. The intermediate tensors which were just computed from something else might still be on that cache.

- Format ggerganov/ggml#27

meakbiyik · 2023-03-19T10:03:23Z

Just to cross-reference: 4-bit quantization does not give the expected performance improvement in non-Apple ARM processors. In fact, there is a drastic reduction in performance: ggerganov/whisper.cpp#540 (comment)

mallorbc · 2023-03-26T16:37:49Z

Is there a reason why llama.cpp supports 4 bit quantization on x86 processors but GPTJ does not work with 4 bit and x86?

Edit:
Looking at some of the commits and edit history for the main comment, it seems that perhaps x86 is supported now and the comment just doesn't reflect that. I see commits relating to x86 3 weeks ago and the last time the main comment was updated was a month ago. I will try to see if I get 4bit working on x86.

iamfaith · 2023-03-29T17:28:46Z

Dolly like GPT-J quantized success but load fail

gptj_model_load: tensor 'transformer.h.0.mlp.fc_in.weight' has wrong shape in model file: got [4096, 16384], expected [16384, 4096]

ahoho · 2023-04-02T16:11:08Z

I made a note elsewhere, but I'm finding q4_1 to be worse than q4_0 in at least one instance.

ggerganov · 2023-04-10T10:27:32Z

@ahoho
There might be a bug in the ARM_NEON Q4_1 implementation. I got additional reports indicating that.
Still haven't had time to look into that

ggerganov force-pushed the gq branch 2 times, most recently from d455387 to 24330b8 Compare February 26, 2023 19:11

ggerganov mentioned this pull request Feb 26, 2023

Integer quantisation support ggerganov/whisper.cpp#540

Merged

ggerganov force-pushed the gq branch from 705bc79 to 3adf02e Compare March 6, 2023 20:33

ggerganov mentioned this pull request Mar 10, 2023

Windows VS2022 Build - Returning nonsense ggerganov/llama.cpp#2

Closed

This was referenced Mar 12, 2023

What is the meaning of hacked? ggerganov/llama.cpp#33

Closed

Improving quality with 8bit? ggerganov/llama.cpp#53

Closed

Narsil added a commit to huggingface/safetensors that referenced this pull request Mar 17, 2023

Adding Q4_{0/1} support.

bd27e75

- Format ggerganov/ggml#27

nkoppel mentioned this pull request Mar 21, 2023

Support for 4 bit Quantization coreylowman/dfdx#580

Open

ggerganov mentioned this pull request Mar 22, 2023

Investigate alternative approach for Q4 quantization ggerganov/llama.cpp#397

Closed

ggerganov added 6 commits March 29, 2023 21:03

gq : attempt at n-bit quantization

8f45628

gq : add amax based method 3

b0a46fd

gq : progress on method 2

da2de94

gq : method 4 (AVX2)

aa5506c

gq : method 4 (ARM)

1fc11de

gq : method 4 (AVX2 attempt) + method 5 (no min)

dae323c

ggerganov added 22 commits March 29, 2023 21:03

ggml : vectorized mad q4_0 (ARM)

2e75d8f

ggml : vectorized quantize_row_q4_0 (ARM)

b0c22a4

ggml : simplify mad q4_0 (ARM)

3c757a4

ggml : minor indentations

e3ad879

gpt-j : support for 4-bit quantized model inference

8abcab4

ggml : GGML_ASSERT() instead of assert() where appropriate

c21972c

gpt : avoid ggml_transpose on model tensors (new models!)

4a56c5b

gpt-2 : minor

904605c

gpt-j : fix conversion for FP16 models (such as GPT-JT-6B)

441a38f

ggml : add ggml_compute_forward_rope_f16()

5336828

gpt : fix memory usage computation

99af48e

ggml : fix ggml_is_contiguous() to take into account blck size

6aae09e

whisper : add whisper-qunatize tool

d0ac5eb

whisper : add support for quantized models

37d427d

whisper : mem usage based on model format type

e904a58

gpt : seems not worth to use FP16 for KV cache

63a8f62

gpt : support quantisation of f16 models files

519ce47

ggml : fixes for rpi4

9881c2b

whisper : add Q4_1 model sizes

a85bc0f

ggml : add WASM SIMD for Q4_0

c4f1403

utils : print quantization histograms

331a862

ggml : sync all changes from llama.cpp and whisper.cpp

154fcc3

ggerganov force-pushed the gq branch from 503722c to 154fcc3 Compare March 29, 2023 18:03

ggml : finalize the Q4_1 quantization for ARM_NEON

724c45d

ggerganov merged commit acd4aee into master Mar 29, 2023

ggerganov deleted the gq branch March 29, 2023 19:21

ggerganov restored the gq branch June 13, 2023 07:35

byte-6174 mentioned this pull request Aug 15, 2023

8-bit Quantization karpathy/llama2.c#298

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4-bit Integer quantisation #27

4-bit Integer quantisation #27

ggerganov commented Feb 26, 2023 •

edited

Loading

ocordeiro commented Mar 1, 2023

ggerganov commented Mar 2, 2023

ocordeiro commented Mar 2, 2023

ocordeiro commented Mar 4, 2023

tmzt commented Mar 5, 2023

ocordeiro commented Mar 5, 2023

Const-me commented Mar 11, 2023

ggerganov commented Mar 11, 2023

Const-me commented Mar 11, 2023

ggerganov commented Mar 11, 2023 •

edited

Loading

Const-me commented Mar 11, 2023

meakbiyik commented Mar 19, 2023 •

edited

Loading

mallorbc commented Mar 26, 2023 •

edited

Loading

iamfaith commented Mar 29, 2023

ahoho commented Apr 2, 2023

ggerganov commented Apr 10, 2023

	// scalar
	for (int i = 0; i < nb; i++) {
	float amax = 0.0f; // absolute max

	for (int l = 0; l < QK; l++) {
	const float v = x[i*QK + l];
	amax = MAX(amax, fabsf(v));
	}

	const float d = amax / ((1 << 3) - 1);
	const float id = d ? 1.0f/d : 0.0f;

	pd[i] = d;

	for (int l = 0; l < QK; l += 2) {
	const float v0 = x[iQK + l + 0]id;
	const float v1 = x[iQK + l + 1]id;

	const uint8_t vi0 = ((int8_t) (round(v0))) + 8;
	const uint8_t vi1 = ((int8_t) (round(v1))) + 8;

	assert(vi0 >= 0 && vi0 < 16);
	assert(vi1 >= 0 && vi1 < 16);

	pp[l/2] = vi0 \| (vi1 << 4);
	}

	memcpy(pb + i*QK/2, pp, sizeof(pp));
	}

	// method 4
	// blocks of QK elements
	// represented with 2 floats (min + delta) and QK/2 8-bit ints (i.e QK 4-bit unsigned integer factors)
	void quantize_row_q4_1(const float * restrict x, void * restrict y, int k) {
	assert(k % QK == 0);

	const int nb = k / QK;

	float * restrict pm = (float *) (y);
	float * restrict pd = (float *) (pm + nb);
	uint8_t * restrict pb = (uint8_t *) (pd + nb);

	uint8_t pp[QK/2];

	for (int i = 0; i < nb; i++) {
	float min = FLT_MAX;
	float max = -FLT_MAX;

	for (int l = 0; l < QK; l++) {
	const float v = x[i*QK + l];
	if (v < min) min = v;
	if (v > max) max = v;
	}

	const float d = (max - min) / ((1 << 4) - 1);
	const float id = d ? 1.0f/d : 0.0f;

	pm[i] = min;
	pd[i] = d;

	for (int l = 0; l < QK; l += 2) {
	const float v0 = (x[iQK + l + 0] - min)id;
	const float v1 = (x[iQK + l + 1] - min)id;

	const uint8_t vi0 = round(v0);
	const uint8_t vi1 = round(v1);

	assert(vi0 >= 0 && vi0 < 16);
	assert(vi1 >= 0 && vi1 < 16);

	pp[l/2] = vi0 \| (vi1 << 4);
	}

	memcpy(pb + i*QK/2, pp, sizeof(pp));
	}
	}

4-bit Integer quantisation #27

4-bit Integer quantisation #27

Conversation

ggerganov commented Feb 26, 2023 • edited Loading

Language Models:

Whisper:

Details

Q4_0

Q4_1

GQ

Choosing QK

Running

Implementation progress

Q4_0

Q4_1

ocordeiro commented Mar 1, 2023

ggerganov commented Mar 2, 2023

ocordeiro commented Mar 2, 2023

ocordeiro commented Mar 4, 2023

tmzt commented Mar 5, 2023

ocordeiro commented Mar 5, 2023

Const-me commented Mar 11, 2023

ggerganov commented Mar 11, 2023

Const-me commented Mar 11, 2023

ggerganov commented Mar 11, 2023 • edited Loading

Const-me commented Mar 11, 2023

meakbiyik commented Mar 19, 2023 • edited Loading

mallorbc commented Mar 26, 2023 • edited Loading

iamfaith commented Mar 29, 2023

ahoho commented Apr 2, 2023

ggerganov commented Apr 10, 2023

ggerganov commented Feb 26, 2023 •

edited

Loading

ggerganov commented Mar 11, 2023 •

edited

Loading

meakbiyik commented Mar 19, 2023 •

edited

Loading

mallorbc commented Mar 26, 2023 •

edited

Loading