Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q4_0 scale selection using RMSE #835

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Q4_0 scale selection using RMSE #835

wants to merge 3 commits into from

Conversation

sw
Copy link
Collaborator

@sw sw commented Apr 7, 2023

This combines some ideas from PR #729 and issue #397 to select a scale factor for Q4_0 with low RMS error.

In order to KISS, I simply made a table of 8 hard-coded values, after analysing the optimum values in steps of 0.1.
The result of that analysis is documented in examples/quantize/scale.py and reproduced here:
scale

Error statistics (#728):

q4_0 : rmse 0.00221840, maxerr 0.14257812, 95pct<0.0040, median<0.0018 (master)
q4_0 : rmse 0.00196398, maxerr 0.18200684, 95pct<0.0036, median<0.0016 (#729)
q4_0 : rmse 0.00185915, maxerr 0.14257812, 95pct<0.0034, median<0.0014 (this PR)

quantize.cpp run time on 7B:

80s (master cc9cee8)
135s (this PR, AVX2)
385s (this PR, scalar)

I introduce a minor version number at the very end of the file.
This allows us to nudge the user to re-create their files without breaking anything.
I had to modify the read loop, as it used to try to read past EOF.

I removed the test of ggml_quantize_q4_0 which I originally wrote and was quite minimal.
This is admittedly lazy, but I couldn't think of a good test right away.
Maybe we just need to provide a model file that's not too big for the CI machines and check for equivalence after quantization.

The alignment macros are a bit of a hack. I don't have Windows to test here and don't want to keep hitting the CI with trial-and-error.
Is there a clean cross-platform way to do it? And come to think about alignment, why are the input floats not aligned? (edit: probably because llama_model_quantize_internal doesn't use mmap, let me see if we can force the alignment of the buffers).

Currently running perplexity, but it's taking 12 hours here so I may not wait for that.

This does not obsolete #729, as my PR only changes the method for the model generation.
We might still use @unbounded's work and set the scale to -8 instead of +7 for the other uses of the quantization function.

@ggerganov
Copy link
Owner

Very interesting analysis and data 😄
Curious to see the new perplexity values

Btw, I've been thinking a little bit about how to determine the scale factor to minimize the RMS and I am fairly certain that there is a straightforward way to compute the optimum value without search. I don't have the formula yet - just a strong intuition lol

@sw
Copy link
Collaborator Author

sw commented Apr 7, 2023

I am fairly certain that there is a straightforward way to compute the optimum value without search.

I'd love to see that, but while the error function seems to be smooth continuous and piecewise differentiable (at the point where the rounding flips, abs(error) stays the same), it doesn't seem evident to me.

Here's the plot of the very first block in the 7B model (first input value = 9.888411e-05):
first-block

(This is just the sum of squared errors, I didn't bother with the square root and scaling by QK.)

@unbounded
Copy link
Collaborator

So as mentioned in #397 (comment) I believe I have an RMSE-optimal but very slow implementation of the scaling search...

And your implementation gets extremely close!:

You posted:

q4_0 : rmse 0.00185915, maxerr 0.14257812, 95pct<0.0034, median<0.0014 (this PR)

"optimal":

q4_0 : rmse 0.00184913, maxerr 0.14257812, 95pct<0.0034, median<0.0014

That probably about as good as we can hope for.

Full output for verification - if you get a lower RMSE for any layer I have a bug :)

quantize-stats output
note: source model is f16
testing 226 layers with max size 131072000
q4_0::layers.0.attention.wk.weight                : rmse 0.00292301, maxerr 0.07012939, 95pct<0.0060, median<0.0018
q4_0::layers.0.attention.wo.weight                : rmse 0.00100800, maxerr 0.03466797, 95pct<0.0020, median<0.0008
q4_0::layers.0.attention.wq.weight                : rmse 0.00299672, maxerr 0.04935372, 95pct<0.0062, median<0.0016
q4_0::layers.0.attention.wv.weight                : rmse 0.00110658, maxerr 0.01175963, 95pct<0.0022, median<0.0008
q4_0::layers.0.feed_forward.w1.weight             : rmse 0.00137271, maxerr 0.06394231, 95pct<0.0026, median<0.0012
q4_0::layers.0.feed_forward.w2.weight             : rmse 0.00167287, maxerr 0.05469465, 95pct<0.0030, median<0.0014
q4_0::layers.0.feed_forward.w3.weight             : rmse 0.00132967, maxerr 0.01823677, 95pct<0.0024, median<0.0012
q4_0::layers.1.attention.wk.weight                : rmse 0.00282859, maxerr 0.04061890, 95pct<0.0060, median<0.0016
q4_0::layers.1.attention.wo.weight                : rmse 0.00097638, maxerr 0.03619385, 95pct<0.0020, median<0.0008
q4_0::layers.1.attention.wq.weight                : rmse 0.00276326, maxerr 0.03581724, 95pct<0.0058, median<0.0016
q4_0::layers.1.attention.wv.weight                : rmse 0.00094906, maxerr 0.00811867, 95pct<0.0020, median<0.0008
q4_0::layers.1.feed_forward.w1.weight             : rmse 0.00174292, maxerr 0.03725962, 95pct<0.0032, median<0.0014
q4_0::layers.1.feed_forward.w2.weight             : rmse 0.00171300, maxerr 0.05225383, 95pct<0.0032, median<0.0014
q4_0::layers.1.feed_forward.w3.weight             : rmse 0.00165360, maxerr 0.02176375, 95pct<0.0030, median<0.0014
q4_0::layers.10.attention.wk.weight               : rmse 0.00225472, maxerr 0.02979095, 95pct<0.0044, median<0.0016
q4_0::layers.10.attention.wo.weight               : rmse 0.00143767, maxerr 0.04605806, 95pct<0.0026, median<0.0012
q4_0::layers.10.attention.wq.weight               : rmse 0.00222988, maxerr 0.03421440, 95pct<0.0044, median<0.0016
q4_0::layers.10.attention.wv.weight               : rmse 0.00144258, maxerr 0.01459024, 95pct<0.0028, median<0.0012
q4_0::layers.10.feed_forward.w1.weight            : rmse 0.00183416, maxerr 0.02703372, 95pct<0.0034, median<0.0014
q4_0::layers.10.feed_forward.w2.weight            : rmse 0.00174484, maxerr 0.04180530, 95pct<0.0032, median<0.0014
q4_0::layers.10.feed_forward.w3.weight            : rmse 0.00177285, maxerr 0.02142334, 95pct<0.0032, median<0.0014
q4_0::layers.11.attention.wk.weight               : rmse 0.00233274, maxerr 0.02713823, 95pct<0.0046, median<0.0018
q4_0::layers.11.attention.wo.weight               : rmse 0.00150656, maxerr 0.03012497, 95pct<0.0028, median<0.0012
q4_0::layers.11.attention.wq.weight               : rmse 0.00229496, maxerr 0.04412842, 95pct<0.0044, median<0.0018
q4_0::layers.11.attention.wv.weight               : rmse 0.00151707, maxerr 0.02018456, 95pct<0.0028, median<0.0012
q4_0::layers.11.feed_forward.w1.weight            : rmse 0.00182944, maxerr 0.02489303, 95pct<0.0034, median<0.0014
q4_0::layers.11.feed_forward.w2.weight            : rmse 0.00175960, maxerr 0.05431067, 95pct<0.0032, median<0.0014
q4_0::layers.11.feed_forward.w3.weight            : rmse 0.00178396, maxerr 0.02583313, 95pct<0.0032, median<0.0014
q4_0::layers.12.attention.wk.weight               : rmse 0.00222511, maxerr 0.02594195, 95pct<0.0044, median<0.0016
q4_0::layers.12.attention.wo.weight               : rmse 0.00147925, maxerr 0.02380922, 95pct<0.0028, median<0.0012
q4_0::layers.12.attention.wq.weight               : rmse 0.00218941, maxerr 0.03447523, 95pct<0.0042, median<0.0016
q4_0::layers.12.attention.wv.weight               : rmse 0.00146629, maxerr 0.01153804, 95pct<0.0028, median<0.0012
q4_0::layers.12.feed_forward.w1.weight            : rmse 0.00183979, maxerr 0.03383104, 95pct<0.0034, median<0.0014
q4_0::layers.12.feed_forward.w2.weight            : rmse 0.00176264, maxerr 0.05683154, 95pct<0.0032, median<0.0014
q4_0::layers.12.feed_forward.w3.weight            : rmse 0.00179476, maxerr 0.01740211, 95pct<0.0034, median<0.0014
q4_0::layers.13.attention.wk.weight               : rmse 0.00217331, maxerr 0.02676816, 95pct<0.0044, median<0.0016
q4_0::layers.13.attention.wo.weight               : rmse 0.00153305, maxerr 0.04341370, 95pct<0.0028, median<0.0012
q4_0::layers.13.attention.wq.weight               : rmse 0.00213820, maxerr 0.03543091, 95pct<0.0042, median<0.0016
q4_0::layers.13.attention.wv.weight               : rmse 0.00153372, maxerr 0.01126552, 95pct<0.0028, median<0.0012
q4_0::layers.13.feed_forward.w1.weight            : rmse 0.00183155, maxerr 0.02292150, 95pct<0.0034, median<0.0014
q4_0::layers.13.feed_forward.w2.weight            : rmse 0.00177651, maxerr 0.03530073, 95pct<0.0032, median<0.0014
q4_0::layers.13.feed_forward.w3.weight            : rmse 0.00181181, maxerr 0.01798833, 95pct<0.0034, median<0.0016
q4_0::layers.14.attention.wk.weight               : rmse 0.00217185, maxerr 0.02497105, 95pct<0.0042, median<0.0016
q4_0::layers.14.attention.wo.weight               : rmse 0.00153627, maxerr 0.06212232, 95pct<0.0028, median<0.0012
q4_0::layers.14.attention.wq.weight               : rmse 0.00215347, maxerr 0.03887939, 95pct<0.0042, median<0.0016
q4_0::layers.14.attention.wv.weight               : rmse 0.00154264, maxerr 0.01345214, 95pct<0.0028, median<0.0012
q4_0::layers.14.feed_forward.w1.weight            : rmse 0.00182898, maxerr 0.02304077, 95pct<0.0034, median<0.0014
q4_0::layers.14.feed_forward.w2.weight            : rmse 0.00178511, maxerr 0.05890521, 95pct<0.0032, median<0.0014
q4_0::layers.14.feed_forward.w3.weight            : rmse 0.00181856, maxerr 0.02665675, 95pct<0.0034, median<0.0016
q4_0::layers.15.attention.wk.weight               : rmse 0.00219269, maxerr 0.02394998, 95pct<0.0044, median<0.0016
q4_0::layers.15.attention.wo.weight               : rmse 0.00154000, maxerr 0.02813050, 95pct<0.0028, median<0.0012
q4_0::layers.15.attention.wq.weight               : rmse 0.00215290, maxerr 0.03628540, 95pct<0.0042, median<0.0016
q4_0::layers.15.attention.wv.weight               : rmse 0.00154661, maxerr 0.01409675, 95pct<0.0028, median<0.0012
q4_0::layers.15.feed_forward.w1.weight            : rmse 0.00182940, maxerr 0.02419187, 95pct<0.0034, median<0.0014
q4_0::layers.15.feed_forward.w2.weight            : rmse 0.00178558, maxerr 0.05858561, 95pct<0.0032, median<0.0014
q4_0::layers.15.feed_forward.w3.weight            : rmse 0.00181912, maxerr 0.02241516, 95pct<0.0034, median<0.0016
q4_0::layers.16.attention.wk.weight               : rmse 0.00217754, maxerr 0.02458954, 95pct<0.0042, median<0.0016
q4_0::layers.16.attention.wo.weight               : rmse 0.00163187, maxerr 0.05107081, 95pct<0.0030, median<0.0014
q4_0::layers.16.attention.wq.weight               : rmse 0.00212385, maxerr 0.04119629, 95pct<0.0040, median<0.0016
q4_0::layers.16.attention.wv.weight               : rmse 0.00164553, maxerr 0.01337417, 95pct<0.0030, median<0.0014
q4_0::layers.16.feed_forward.w1.weight            : rmse 0.00184241, maxerr 0.02344798, 95pct<0.0034, median<0.0016
q4_0::layers.16.feed_forward.w2.weight            : rmse 0.00178439, maxerr 0.05552104, 95pct<0.0032, median<0.0014
q4_0::layers.16.feed_forward.w3.weight            : rmse 0.00181314, maxerr 0.02277143, 95pct<0.0034, median<0.0016
q4_0::layers.17.attention.wk.weight               : rmse 0.00212176, maxerr 0.02422421, 95pct<0.0042, median<0.0016
q4_0::layers.17.attention.wo.weight               : rmse 0.00165387, maxerr 0.03002930, 95pct<0.0030, median<0.0014
q4_0::layers.17.attention.wq.weight               : rmse 0.00207895, maxerr 0.04604350, 95pct<0.0040, median<0.0016
q4_0::layers.17.attention.wv.weight               : rmse 0.00165649, maxerr 0.01419830, 95pct<0.0030, median<0.0014
q4_0::layers.17.feed_forward.w1.weight            : rmse 0.00184599, maxerr 0.02392328, 95pct<0.0034, median<0.0016
q4_0::layers.17.feed_forward.w2.weight            : rmse 0.00179142, maxerr 0.04622682, 95pct<0.0032, median<0.0014
q4_0::layers.17.feed_forward.w3.weight            : rmse 0.00181806, maxerr 0.02359099, 95pct<0.0034, median<0.0016
q4_0::layers.18.attention.wk.weight               : rmse 0.00208260, maxerr 0.02502441, 95pct<0.0040, median<0.0016
q4_0::layers.18.attention.wo.weight               : rmse 0.00164773, maxerr 0.03822631, 95pct<0.0030, median<0.0014
q4_0::layers.18.attention.wq.weight               : rmse 0.00205646, maxerr 0.04051746, 95pct<0.0040, median<0.0016
q4_0::layers.18.attention.wv.weight               : rmse 0.00165172, maxerr 0.01335841, 95pct<0.0030, median<0.0014
q4_0::layers.18.feed_forward.w1.weight            : rmse 0.00186100, maxerr 0.03084695, 95pct<0.0034, median<0.0016
q4_0::layers.18.feed_forward.w2.weight            : rmse 0.00178702, maxerr 0.06258377, 95pct<0.0032, median<0.0014
q4_0::layers.18.feed_forward.w3.weight            : rmse 0.00181154, maxerr 0.01813507, 95pct<0.0034, median<0.0016
q4_0::layers.19.attention.wk.weight               : rmse 0.00204409, maxerr 0.02549587, 95pct<0.0040, median<0.0016
q4_0::layers.19.attention.wo.weight               : rmse 0.00171742, maxerr 0.04106662, 95pct<0.0032, median<0.0014
q4_0::layers.19.attention.wq.weight               : rmse 0.00202074, maxerr 0.04685394, 95pct<0.0040, median<0.0016
q4_0::layers.19.attention.wv.weight               : rmse 0.00173205, maxerr 0.01311102, 95pct<0.0032, median<0.0014
q4_0::layers.19.feed_forward.w1.weight            : rmse 0.00187151, maxerr 0.03121948, 95pct<0.0034, median<0.0016
q4_0::layers.19.feed_forward.w2.weight            : rmse 0.00178935, maxerr 0.04564381, 95pct<0.0032, median<0.0014
q4_0::layers.19.feed_forward.w3.weight            : rmse 0.00180759, maxerr 0.02138457, 95pct<0.0034, median<0.0016
q4_0::layers.2.attention.wk.weight                : rmse 0.00310555, maxerr 0.03675859, 95pct<0.0064, median<0.0020
q4_0::layers.2.attention.wo.weight                : rmse 0.00115159, maxerr 0.04546779, 95pct<0.0022, median<0.0010
q4_0::layers.2.attention.wq.weight                : rmse 0.00298841, maxerr 0.03752440, 95pct<0.0060, median<0.0020
q4_0::layers.2.attention.wv.weight                : rmse 0.00112951, maxerr 0.00926531, 95pct<0.0022, median<0.0010
q4_0::layers.2.feed_forward.w1.weight             : rmse 0.00183671, maxerr 0.05353853, 95pct<0.0034, median<0.0016
q4_0::layers.2.feed_forward.w2.weight             : rmse 0.00170433, maxerr 0.09649658, 95pct<0.0032, median<0.0014
q4_0::layers.2.feed_forward.w3.weight             : rmse 0.00167454, maxerr 0.03201294, 95pct<0.0030, median<0.0014
q4_0::layers.20.attention.wk.weight               : rmse 0.00207524, maxerr 0.02473852, 95pct<0.0040, median<0.0016
q4_0::layers.20.attention.wo.weight               : rmse 0.00176106, maxerr 0.02588722, 95pct<0.0032, median<0.0014
q4_0::layers.20.attention.wq.weight               : rmse 0.00204837, maxerr 0.05462646, 95pct<0.0040, median<0.0016
q4_0::layers.20.attention.wv.weight               : rmse 0.00178526, maxerr 0.01499712, 95pct<0.0034, median<0.0014
q4_0::layers.20.feed_forward.w1.weight            : rmse 0.00188099, maxerr 0.02917725, 95pct<0.0034, median<0.0016
q4_0::layers.20.feed_forward.w2.weight            : rmse 0.00179125, maxerr 0.06890869, 95pct<0.0032, median<0.0014
q4_0::layers.20.feed_forward.w3.weight            : rmse 0.00180859, maxerr 0.01596069, 95pct<0.0034, median<0.0016
q4_0::layers.21.attention.wk.weight               : rmse 0.00200054, maxerr 0.02908368, 95pct<0.0040, median<0.0014
q4_0::layers.21.attention.wo.weight               : rmse 0.00177119, maxerr 0.05007464, 95pct<0.0032, median<0.0014
q4_0::layers.21.attention.wq.weight               : rmse 0.00198177, maxerr 0.05149466, 95pct<0.0038, median<0.0014
q4_0::layers.21.attention.wv.weight               : rmse 0.00179837, maxerr 0.01333202, 95pct<0.0034, median<0.0014
q4_0::layers.21.feed_forward.w1.weight            : rmse 0.00189033, maxerr 0.03076535, 95pct<0.0034, median<0.0016
q4_0::layers.21.feed_forward.w2.weight            : rmse 0.00178966, maxerr 0.03637502, 95pct<0.0032, median<0.0014
q4_0::layers.21.feed_forward.w3.weight            : rmse 0.00180614, maxerr 0.02140096, 95pct<0.0034, median<0.0016
q4_0::layers.22.attention.wk.weight               : rmse 0.00203025, maxerr 0.03339660, 95pct<0.0040, median<0.0016
q4_0::layers.22.attention.wo.weight               : rmse 0.00177702, maxerr 0.07931513, 95pct<0.0032, median<0.0014
q4_0::layers.22.attention.wq.weight               : rmse 0.00201616, maxerr 0.04454328, 95pct<0.0038, median<0.0016
q4_0::layers.22.attention.wv.weight               : rmse 0.00178748, maxerr 0.01423188, 95pct<0.0034, median<0.0014
q4_0::layers.22.feed_forward.w1.weight            : rmse 0.00189302, maxerr 0.02517700, 95pct<0.0034, median<0.0016
q4_0::layers.22.feed_forward.w2.weight            : rmse 0.00179775, maxerr 0.04281616, 95pct<0.0034, median<0.0016
q4_0::layers.22.feed_forward.w3.weight            : rmse 0.00181394, maxerr 0.03024019, 95pct<0.0034, median<0.0016
q4_0::layers.23.attention.wk.weight               : rmse 0.00195991, maxerr 0.02972737, 95pct<0.0038, median<0.0014
q4_0::layers.23.attention.wo.weight               : rmse 0.00182629, maxerr 0.04887556, 95pct<0.0034, median<0.0016
q4_0::layers.23.attention.wq.weight               : rmse 0.00195417, maxerr 0.04232788, 95pct<0.0038, median<0.0014
q4_0::layers.23.attention.wv.weight               : rmse 0.00185857, maxerr 0.01577342, 95pct<0.0034, median<0.0016
q4_0::layers.23.feed_forward.w1.weight            : rmse 0.00189658, maxerr 0.03308105, 95pct<0.0034, median<0.0016
q4_0::layers.23.feed_forward.w2.weight            : rmse 0.00180367, maxerr 0.04928589, 95pct<0.0034, median<0.0016
q4_0::layers.23.feed_forward.w3.weight            : rmse 0.00181737, maxerr 0.02468872, 95pct<0.0034, median<0.0016
q4_0::layers.24.attention.wk.weight               : rmse 0.00196715, maxerr 0.02162942, 95pct<0.0038, median<0.0014
q4_0::layers.24.attention.wo.weight               : rmse 0.00184930, maxerr 0.03620195, 95pct<0.0034, median<0.0016
q4_0::layers.24.attention.wq.weight               : rmse 0.00195618, maxerr 0.04705903, 95pct<0.0038, median<0.0014
q4_0::layers.24.attention.wv.weight               : rmse 0.00188009, maxerr 0.01770980, 95pct<0.0034, median<0.0016
q4_0::layers.24.feed_forward.w1.weight            : rmse 0.00189906, maxerr 0.02117351, 95pct<0.0034, median<0.0016
q4_0::layers.24.feed_forward.w2.weight            : rmse 0.00181186, maxerr 0.05899048, 95pct<0.0034, median<0.0016
q4_0::layers.24.feed_forward.w3.weight            : rmse 0.00182756, maxerr 0.02068704, 95pct<0.0034, median<0.0016
q4_0::layers.25.attention.wk.weight               : rmse 0.00202900, maxerr 0.02362627, 95pct<0.0038, median<0.0016
q4_0::layers.25.attention.wo.weight               : rmse 0.00186576, maxerr 0.06477863, 95pct<0.0034, median<0.0016
q4_0::layers.25.attention.wq.weight               : rmse 0.00200834, maxerr 0.03808594, 95pct<0.0038, median<0.0016
q4_0::layers.25.attention.wv.weight               : rmse 0.00188682, maxerr 0.01595676, 95pct<0.0034, median<0.0016
q4_0::layers.25.feed_forward.w1.weight            : rmse 0.00190352, maxerr 0.02079988, 95pct<0.0036, median<0.0016
q4_0::layers.25.feed_forward.w2.weight            : rmse 0.00181823, maxerr 0.03286743, 95pct<0.0034, median<0.0016
q4_0::layers.25.feed_forward.w3.weight            : rmse 0.00183412, maxerr 0.01735053, 95pct<0.0034, median<0.0016
q4_0::layers.26.attention.wk.weight               : rmse 0.00199544, maxerr 0.02733952, 95pct<0.0038, median<0.0016
q4_0::layers.26.attention.wo.weight               : rmse 0.00192010, maxerr 0.02563220, 95pct<0.0036, median<0.0016
q4_0::layers.26.attention.wq.weight               : rmse 0.00197604, maxerr 0.03735352, 95pct<0.0038, median<0.0016
q4_0::layers.26.attention.wv.weight               : rmse 0.00194300, maxerr 0.01509885, 95pct<0.0036, median<0.0016
q4_0::layers.26.feed_forward.w1.weight            : rmse 0.00190232, maxerr 0.03396144, 95pct<0.0036, median<0.0016
q4_0::layers.26.feed_forward.w2.weight            : rmse 0.00183005, maxerr 0.04354858, 95pct<0.0034, median<0.0016
q4_0::layers.26.feed_forward.w3.weight            : rmse 0.00184771, maxerr 0.03059387, 95pct<0.0034, median<0.0016
q4_0::layers.27.attention.wk.weight               : rmse 0.00198943, maxerr 0.02681477, 95pct<0.0038, median<0.0016
q4_0::layers.27.attention.wo.weight               : rmse 0.00196662, maxerr 0.05517289, 95pct<0.0036, median<0.0016
q4_0::layers.27.attention.wq.weight               : rmse 0.00198182, maxerr 0.03899045, 95pct<0.0038, median<0.0016
q4_0::layers.27.attention.wv.weight               : rmse 0.00197615, maxerr 0.01669417, 95pct<0.0036, median<0.0016
q4_0::layers.27.feed_forward.w1.weight            : rmse 0.00190184, maxerr 0.02731323, 95pct<0.0036, median<0.0016
q4_0::layers.27.feed_forward.w2.weight            : rmse 0.00184141, maxerr 0.04620361, 95pct<0.0034, median<0.0016
q4_0::layers.27.feed_forward.w3.weight            : rmse 0.00185554, maxerr 0.04153442, 95pct<0.0034, median<0.0016
q4_0::layers.28.attention.wk.weight               : rmse 0.00194652, maxerr 0.02850908, 95pct<0.0038, median<0.0014
q4_0::layers.28.attention.wo.weight               : rmse 0.00198808, maxerr 0.03118306, 95pct<0.0036, median<0.0016
q4_0::layers.28.attention.wq.weight               : rmse 0.00194074, maxerr 0.04092407, 95pct<0.0038, median<0.0014
q4_0::layers.28.attention.wv.weight               : rmse 0.00198781, maxerr 0.01527815, 95pct<0.0036, median<0.0016
q4_0::layers.28.feed_forward.w1.weight            : rmse 0.00189349, maxerr 0.03170776, 95pct<0.0034, median<0.0016
q4_0::layers.28.feed_forward.w2.weight            : rmse 0.00185116, maxerr 0.05222714, 95pct<0.0034, median<0.0016
q4_0::layers.28.feed_forward.w3.weight            : rmse 0.00186482, maxerr 0.03631857, 95pct<0.0034, median<0.0016
q4_0::layers.29.attention.wk.weight               : rmse 0.00193214, maxerr 0.02202798, 95pct<0.0038, median<0.0014
q4_0::layers.29.attention.wo.weight               : rmse 0.00204716, maxerr 0.03959709, 95pct<0.0038, median<0.0016
q4_0::layers.29.attention.wq.weight               : rmse 0.00192283, maxerr 0.04244995, 95pct<0.0036, median<0.0014
q4_0::layers.29.attention.wv.weight               : rmse 0.00204643, maxerr 0.01519237, 95pct<0.0038, median<0.0016
q4_0::layers.29.feed_forward.w1.weight            : rmse 0.00189820, maxerr 0.03314209, 95pct<0.0036, median<0.0016
q4_0::layers.29.feed_forward.w2.weight            : rmse 0.00186130, maxerr 0.09802246, 95pct<0.0034, median<0.0016
q4_0::layers.29.feed_forward.w3.weight            : rmse 0.00187583, maxerr 0.02655141, 95pct<0.0034, median<0.0016
q4_0::layers.3.attention.wk.weight                : rmse 0.00257589, maxerr 0.03777078, 95pct<0.0052, median<0.0018
q4_0::layers.3.attention.wo.weight                : rmse 0.00133662, maxerr 0.04435936, 95pct<0.0024, median<0.0012
q4_0::layers.3.attention.wq.weight                : rmse 0.00246442, maxerr 0.04611765, 95pct<0.0048, median<0.0018
q4_0::layers.3.attention.wv.weight                : rmse 0.00133929, maxerr 0.01030663, 95pct<0.0024, median<0.0012
q4_0::layers.3.feed_forward.w1.weight             : rmse 0.00185664, maxerr 0.03087639, 95pct<0.0034, median<0.0016
q4_0::layers.3.feed_forward.w2.weight             : rmse 0.00171196, maxerr 0.05057278, 95pct<0.0032, median<0.0014
q4_0::layers.3.feed_forward.w3.weight             : rmse 0.00170679, maxerr 0.02278137, 95pct<0.0032, median<0.0014
q4_0::layers.30.attention.wk.weight               : rmse 0.00195269, maxerr 0.03295821, 95pct<0.0038, median<0.0016
q4_0::layers.30.attention.wo.weight               : rmse 0.00204545, maxerr 0.05445015, 95pct<0.0038, median<0.0016
q4_0::layers.30.attention.wq.weight               : rmse 0.00194719, maxerr 0.04063878, 95pct<0.0036, median<0.0016
q4_0::layers.30.attention.wv.weight               : rmse 0.00202005, maxerr 0.01512921, 95pct<0.0038, median<0.0016
q4_0::layers.30.feed_forward.w1.weight            : rmse 0.00191074, maxerr 0.02958679, 95pct<0.0036, median<0.0016
q4_0::layers.30.feed_forward.w2.weight            : rmse 0.00191046, maxerr 0.14257812, 95pct<0.0034, median<0.0016
q4_0::layers.30.feed_forward.w3.weight            : rmse 0.00189492, maxerr 0.04852676, 95pct<0.0034, median<0.0016
q4_0::layers.31.attention.wk.weight               : rmse 0.00201812, maxerr 0.02451627, 95pct<0.0038, median<0.0016
q4_0::layers.31.attention.wo.weight               : rmse 0.00184503, maxerr 0.11907780, 95pct<0.0034, median<0.0014
q4_0::layers.31.attention.wq.weight               : rmse 0.00197563, maxerr 0.02724165, 95pct<0.0038, median<0.0016
q4_0::layers.31.attention.wv.weight               : rmse 0.00182399, maxerr 0.01841706, 95pct<0.0034, median<0.0014
q4_0::layers.31.feed_forward.w1.weight            : rmse 0.00199676, maxerr 0.03135899, 95pct<0.0036, median<0.0016
q4_0::layers.31.feed_forward.w2.weight            : rmse 0.00191905, maxerr 0.11260986, 95pct<0.0036, median<0.0016
q4_0::layers.31.feed_forward.w3.weight            : rmse 0.00197545, maxerr 0.04486084, 95pct<0.0036, median<0.0016
q4_0::layers.4.attention.wk.weight                : rmse 0.00252572, maxerr 0.03471547, 95pct<0.0050, median<0.0018
q4_0::layers.4.attention.wo.weight                : rmse 0.00133709, maxerr 0.05675527, 95pct<0.0026, median<0.0012
q4_0::layers.4.attention.wq.weight                : rmse 0.00250660, maxerr 0.04748535, 95pct<0.0048, median<0.0018
q4_0::layers.4.attention.wv.weight                : rmse 0.00133764, maxerr 0.01021584, 95pct<0.0026, median<0.0012
q4_0::layers.4.feed_forward.w1.weight             : rmse 0.00188008, maxerr 0.03756605, 95pct<0.0034, median<0.0016
q4_0::layers.4.feed_forward.w2.weight             : rmse 0.00170612, maxerr 0.04783656, 95pct<0.0032, median<0.0014
q4_0::layers.4.feed_forward.w3.weight             : rmse 0.00171322, maxerr 0.03393555, 95pct<0.0032, median<0.0014
q4_0::layers.5.attention.wk.weight                : rmse 0.00238210, maxerr 0.03174898, 95pct<0.0046, median<0.0018
q4_0::layers.5.attention.wo.weight                : rmse 0.00135344, maxerr 0.04260254, 95pct<0.0026, median<0.0012
q4_0::layers.5.attention.wq.weight                : rmse 0.00236603, maxerr 0.04248789, 95pct<0.0046, median<0.0018
q4_0::layers.5.attention.wv.weight                : rmse 0.00136147, maxerr 0.01390839, 95pct<0.0026, median<0.0012
q4_0::layers.5.feed_forward.w1.weight             : rmse 0.00191865, maxerr 0.03069225, 95pct<0.0036, median<0.0016
q4_0::layers.5.feed_forward.w2.weight             : rmse 0.00168901, maxerr 0.04306030, 95pct<0.0032, median<0.0014
q4_0::layers.5.feed_forward.w3.weight             : rmse 0.00170621, maxerr 0.02728271, 95pct<0.0032, median<0.0014
q4_0::layers.6.attention.wk.weight                : rmse 0.00243652, maxerr 0.02662471, 95pct<0.0048, median<0.0018
q4_0::layers.6.attention.wo.weight                : rmse 0.00136724, maxerr 0.06586111, 95pct<0.0026, median<0.0012
q4_0::layers.6.attention.wq.weight                : rmse 0.00238362, maxerr 0.04891968, 95pct<0.0046, median<0.0018
q4_0::layers.6.attention.wv.weight                : rmse 0.00137151, maxerr 0.01011706, 95pct<0.0026, median<0.0012
q4_0::layers.6.feed_forward.w1.weight             : rmse 0.00189249, maxerr 0.04025269, 95pct<0.0036, median<0.0016
q4_0::layers.6.feed_forward.w2.weight             : rmse 0.00170650, maxerr 0.04733700, 95pct<0.0032, median<0.0014
q4_0::layers.6.feed_forward.w3.weight             : rmse 0.00172804, maxerr 0.02388000, 95pct<0.0032, median<0.0014
q4_0::layers.7.attention.wk.weight                : rmse 0.00236151, maxerr 0.02808842, 95pct<0.0046, median<0.0018
q4_0::layers.7.attention.wo.weight                : rmse 0.00139671, maxerr 0.03219414, 95pct<0.0026, median<0.0012
q4_0::layers.7.attention.wq.weight                : rmse 0.00234200, maxerr 0.04630763, 95pct<0.0046, median<0.0018
q4_0::layers.7.attention.wv.weight                : rmse 0.00141431, maxerr 0.01132994, 95pct<0.0026, median<0.0012
q4_0::layers.7.feed_forward.w1.weight             : rmse 0.00187467, maxerr 0.03225514, 95pct<0.0034, median<0.0016
q4_0::layers.7.feed_forward.w2.weight             : rmse 0.00171376, maxerr 0.04287861, 95pct<0.0032, median<0.0014
q4_0::layers.7.feed_forward.w3.weight             : rmse 0.00173478, maxerr 0.02548218, 95pct<0.0032, median<0.0014
q4_0::layers.8.attention.wk.weight                : rmse 0.00230324, maxerr 0.02782059, 95pct<0.0046, median<0.0016
q4_0::layers.8.attention.wo.weight                : rmse 0.00139008, maxerr 0.03478485, 95pct<0.0026, median<0.0012
q4_0::layers.8.attention.wq.weight                : rmse 0.00230350, maxerr 0.04038759, 95pct<0.0046, median<0.0016
q4_0::layers.8.attention.wv.weight                : rmse 0.00140037, maxerr 0.01309800, 95pct<0.0026, median<0.0012
q4_0::layers.8.feed_forward.w1.weight             : rmse 0.00187512, maxerr 0.03340167, 95pct<0.0034, median<0.0016
q4_0::layers.8.feed_forward.w2.weight             : rmse 0.00171484, maxerr 0.03771973, 95pct<0.0032, median<0.0014
q4_0::layers.8.feed_forward.w3.weight             : rmse 0.00173956, maxerr 0.02236661, 95pct<0.0032, median<0.0014
q4_0::layers.9.attention.wk.weight                : rmse 0.00224144, maxerr 0.02832547, 95pct<0.0044, median<0.0016
q4_0::layers.9.attention.wo.weight                : rmse 0.00137911, maxerr 0.03645405, 95pct<0.0026, median<0.0012
q4_0::layers.9.attention.wq.weight                : rmse 0.00222848, maxerr 0.04025269, 95pct<0.0044, median<0.0016
q4_0::layers.9.attention.wv.weight                : rmse 0.00139053, maxerr 0.01049893, 95pct<0.0026, median<0.0012
q4_0::layers.9.feed_forward.w1.weight             : rmse 0.00184797, maxerr 0.04046337, 95pct<0.0034, median<0.0016
q4_0::layers.9.feed_forward.w2.weight             : rmse 0.00172856, maxerr 0.04580688, 95pct<0.0032, median<0.0014
q4_0::layers.9.feed_forward.w3.weight             : rmse 0.00175109, maxerr 0.04849243, 95pct<0.0032, median<0.0014
q4_0::output.weight                               : rmse 0.00165429, maxerr 0.02467346, 95pct<0.0032, median<0.0014
q4_0::tok_embeddings.weight                       : rmse 0.00163455, maxerr 0.01590976, 95pct<0.0030, median<0.0014
q4_0                                              : rmse 0.00184913, maxerr 0.14257812, 95pct<0.0034, median<0.0014

@sw
Copy link
Collaborator Author

sw commented Apr 8, 2023

Now that the statistics tool has landed in master, I've rebased my branch and updated the tool to accept an --implementation argument instead of --reference.

@unbounded : I will definitively have a look at your approach, thanks a lot.

edit: pulled in your commit and updated the stats tool. It is indeed slow ;-). 80% of time is spent in qsort, so AVX2-ifying isn't going to help a lot.

quantize.cpp still uses my simple method.

unbounded and others added 2 commits April 8, 2023 10:38
Use a sweep line approach to scan all configurations of quantization,
examining every changeover point where a quantize value changes,
and find the optimal scaling for each configuration analytically.
@@ -1,7 +1,11 @@
700df0d3013b703a806d2ae7f1bfb8e59814e3d06ae78be0c66368a50059f33d models/7B/consolidated.00.pth
0cc0b0a3dc8cd29f005946f8364ac2bbce797e792a40c0fb4114615e4f825976 models/7B/ggml-model-f16.bin
5dec1979849d73e361a8bcc10bc8f53237cbbe435a572882dc87629e011e24b3 models/7B/ggml-model-q4_0.bin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please remove quantized models, since everyone would have their unique quantized models.

Copy link
Collaborator Author

@sw sw Apr 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea would be that model generation is deterministic across platforms and SIMD optimizations, so the files should be identical. Of course if you keep your Q4_0 files without updating to minor version 1, this wouldn't match. I might remove it for this PR, but in the long term I think it's a good idea to ensure everyone uses the same inputs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have generated new quantized model and checksum mathes with yours.

Copy link
Collaborator

@ivanstepanovftw ivanstepanovftw Apr 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, is this checksum is for q4_0 that have no minor version yet?

Edit: Oh, I see, for minor v1. 4 bytes long than previous version 😅

@@ -644,7 +644,7 @@ static bool llama_model_load(
size_t total_size = 0;
model.n_loaded = 0;

while (true) {
while (size_t(fin.tellg()) + 12 < file_size) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather do

int offset = 0;
...
offset += sizeof(total_size) + sizeof(model.n_loaded)

Copy link
Collaborator Author

@sw sw Apr 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_size and model.n_loaded are not written or read from the file, so I don't understand why you would use their sizeof.

I admit that the + 12 could be written better. It is intended to be sizeof(n_dims) + sizeof(length) + sizeof(ftype), the next three elements being read.

} else {
fprintf(stderr, "error: %s not in list of implementations\n", argv[i]);
invalid_param = true;
}
} else if (arg == "-v") {
Copy link
Collaborator

@ivanstepanovftw ivanstepanovftw Apr 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add || --verbose?

@ivanstepanovftw
Copy link
Collaborator

ivanstepanovftw commented Apr 9, 2023

Initial perplexity test.
q4_0, MINOR 0, w/ BLAS (OpenBLAS):

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks (335687 tokens, 512 n_ctx)
74.45 seconds per pass - ETA 13.55 hours
[1]4.3797,[2]4.9554,^C

q4_0, MINOR 0, w/o BLAS:

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks (335687 tokens, 512 n_ctx)
26.22 seconds per pass - ETA 4.77 hours
[1]4.5741,[2]5.0601,^C

Commit 678e138 (shown as 7B_q4_0_1 in plot below)
q4_0, MINOR 1, w/o BLAS:

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks (335687 tokens, 512 n_ctx)
26.93 seconds per pass - ETA 4.90 hours
[1]4.7137,[2]5.2331,
...

Final score [655]6.5655.
7B_q4_0_1.txt

perp_vs_model

And closer:
perp_vs_model

@ivanstepanovftw
Copy link
Collaborator

Leaving another comment to let you know final perplexity [655]6.5655. Perplexity discussion for previous results.

@sw
Copy link
Collaborator Author

sw commented Apr 10, 2023

@ivanstepanovftw Thanks for your effort. The first few values match mine exactly, so I'll trust your results. It's good to see at least a small improvement.

But as I said in #397, maybe the RMSE of the quantization is a distraction. This method leads to a mean scale value of 8.092, so there will be clipping of the maximum value. I would like to see us experiment with #729 but with more (larger) scale values instead of just 7 or 8.

@ggerganov ggerganov linked an issue Apr 14, 2023 that may be closed by this pull request
@mofosyne mofosyne added Less than 4 bits Efforts related to viable quantized models using <4 bits Review Complexity : High Generally require indepth knowledge of LLMs or GPUs enhancement New feature or request labels May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Less than 4 bits Efforts related to viable quantized models using <4 bits research 🔬 Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
Development

Successfully merging this pull request may close these issues.

Investigate alternative approach for Q4 quantization
5 participants