preemptive request : regarding possible bit shuffling sync (just in case) #150

LostRuins · 2023-05-12T17:10:40Z

Hello, this is just a pre-emptive request, on the potential chance that the llamacpp bit shuffling changes are sync'd to this repo, would it be possible to add some indication to these models to differentiate them from the old ones already in existence? A new field, a magic change, a version indicator or something would be very useful. Perhaps #147 can be considered too?

Reason being that since the models are otherwise indistinguishable (same file format and structure), it will be hard to tell whether a model file has bit shuffling or not (old models will load perfectly fine but just generate gibberish).

If then bit shuffling changes are not planned to be upstreamed, then please disregard this issue.

Thanks in advance!

Concedo

ggerganov · 2023-05-12T18:23:26Z

I'm planning to back-port the changes soon.

The way I see it is that users of the ggml library have to implement their own versioning scheme for the models.
The examples in the ggml repo serve just as sample implementations. They are not meant to be used in production.

I'm open to suggestions, but I don't see necessary to extend the examples with versions as they provide scripts for generating quantized models from scratch using the originally distributed Python model files.

LostRuins · 2023-05-13T02:31:16Z

Ah I get that. But I would say that this repo has sort of become the de-facto standard, as all implementations I know of are based off the code here. Implementing my own koboldcpp versioning would fracture the ecosystem since it wouldn't be supported on, for example, Rustformers LLM or llama-cpp-python and vice-versa.

Plus there are quite a few people who use this GGML repo directly, converting their models here and sharing them for downstream use on HF, various forums and over discord servers! (You have no idea how popular GGML has become haha) There are already hundreds of existing quantized models out there.

So as you are the base repo, I can be reasonably confident that all other integrators will follow whatever versioning approach that you take, compared to leaving it to the multiple individual downstream actors.

@philpax any thoughts on this?

henk717 · 2023-05-13T03:16:39Z

Plus one on this, its basically impossible for the end users to reliably differentiate and the only way we have been keeping it managable on a user support side is by having support for all of them which this would allow.

jebcarter-space · 2023-05-13T03:54:43Z

Throwing my support here as a user and someone who sees making arguments for the deployment of the upcoming open source models based on the llama architecture in a business environment, having some kind of versioning on the quantization style will be a big help for support not just in the near future but long term as branches fall back.

I know the quantization scripts are available and the base models can be returned to and re-quantized, but not everyone has the technical capacity for that.

Also, of course, offering my gratitude for the development of llamacpp and the democratization of this technology that it is allowing. My "lets see what an AI co-writer that can't be taken from me looks like" project dropped its starting threshold by a thousand bucks thanks to CPU inference.

Best wishes and good health

LostRuins · 2023-05-13T08:42:57Z

To get the ball rolling, this is my rough proposal (anyone feel free to chip in or modify!)

Change the file magic from ggml to ggmf (0x67676d66), similar to llama.cpp when it started adding versioning.
Then add a 4 byte field after the magic for the file version.

Currently, the existing users of the ggmf magic are llama.cpp (used file version == 1), and RWKV.cpp (using file version >= 100).
To avoid collision, I recommend beginning the file version from the integer value 1000 and incrementing from there. Thereafter, breaking changes can use increasing file versions 1001, 1002 and so on. This also avoids the file magic conflicting with the one primarily used in llama.cpp.

Thoughts?

ggerganov · 2023-05-13T11:17:50Z

Here is another approach:

Do not change magic
Change this to GGML_FILE_VERSION 1000

Change this for all examples to:

ftype += GGML_FILE_VERSION
fout.write((char *) &ftype,           sizeof(hparams.f16));

In the loading code, when we read ftype from the header, we divide it by 1000 to get the quantization version and the reminder is the actual quantization type. I.e., the "old" quantized models, as well as the old and new F16 models, will have a quantization version of 0 and the new quantized models will have a version of 1
Upon breaking change, we bump GGML_FILE_VERSION by 1000
llama.cpp models remain using the current versioning as they have different magic anyway

The benefit of this approach is that all existing F16 model files will remain compatible and we don't have to update the existing python conversion scripts. This will simplify my work, as otherwise I would need to update the F16 ggml and whisper.cpp models that I am hosting, for no reason

LostRuins · 2023-05-13T12:09:57Z

That is pretty clever. Hooray for multiplexing!

philpax · 2023-05-14T00:16:12Z

That would work for us. If the bit shuffling sync changes are brought to this repo, can it be done in such a way that both quantization methods are available?

ggerganov · 2023-05-14T06:02:13Z

@philpax

No - it would be very difficult to maintain so many SIMD routines.

I will now proceed with implementing the proposed versioning and syncing the changes from llama.cpp

…format ref #150

ggerganov · 2023-05-14T07:14:01Z

I just added the GGML_QNT_VERSION constant to ggml.h.
This signifies the current quantization format version - currently 0

When I merge the llama.cpp changes, I will bump this version to 1
See the examples how to use this information to determine if a model file has an old quantization format or not

ggerganov/ggml#150 (comment)

== Relevant log messages from source repo: commit 601a033475645370483973817d987928ea95f36c Author: Georgi Gerganov <[email protected]> Date: Sun May 14 10:20:19 2023 +0300 ggml : add GGML_QNT_VERSION to track quantization format changes ggerganov/ggml#150 (comment)

ggerganov added a commit that referenced this issue May 14, 2023

ggml : add GGML_QNT_VERSION for tracking changes to the quantization …

effcfa6

…format ref #150

ggerganov added a commit to ggerganov/llama.cpp that referenced this issue May 14, 2023

ggml : add GGML_QNT_VERSION to track quantization format changes

601a033

ggerganov/ggml#150 (comment)

ggerganov mentioned this issue May 14, 2023

ggml : new Q4 and Q5 quantization formats + backward ops #154

Merged

LostRuins mentioned this issue May 28, 2023

Quantization does not write the quantization version to ftype ggerganov/llama.cpp#1590

Closed

philpax mentioned this issue May 31, 2023

ggml : unified file format #220

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preemptive request : regarding possible bit shuffling sync (just in case) #150

preemptive request : regarding possible bit shuffling sync (just in case) #150

LostRuins commented May 12, 2023 •

edited

Loading

ggerganov commented May 12, 2023

LostRuins commented May 13, 2023

henk717 commented May 13, 2023

jebcarter-space commented May 13, 2023

LostRuins commented May 13, 2023 •

edited

Loading

ggerganov commented May 13, 2023 •

edited

Loading

LostRuins commented May 13, 2023

philpax commented May 14, 2023

ggerganov commented May 14, 2023

ggerganov commented May 14, 2023

preemptive request : regarding possible bit shuffling sync (just in case) #150

preemptive request : regarding possible bit shuffling sync (just in case) #150

Comments

LostRuins commented May 12, 2023 • edited Loading

ggerganov commented May 12, 2023

LostRuins commented May 13, 2023

henk717 commented May 13, 2023

jebcarter-space commented May 13, 2023

LostRuins commented May 13, 2023 • edited Loading

ggerganov commented May 13, 2023 • edited Loading

LostRuins commented May 13, 2023

philpax commented May 14, 2023

ggerganov commented May 14, 2023

ggerganov commented May 14, 2023

LostRuins commented May 12, 2023 •

edited

Loading

LostRuins commented May 13, 2023 •

edited

Loading

ggerganov commented May 13, 2023 •

edited

Loading