Quantization does not write the quantization version to `ftype` #1590

philpax · 2023-05-25T00:30:00Z

Expected Behavior

When quantizing with llama.cpp, the quantization version should be written to the ftype in the hyperparameters.

Current Behavior

A ftype is produced by llama_model_quantize_internal and is passed through as-is to llama_file_saver, which writes it to disk without encoding it using GGML_QNT_VERSION:

llama.cpp/llama.cpp

Lines 2052 to 2068 in ac7876a

 ggml_type quantized_type; 

 switch (ftype) { 

 case LLAMA_FTYPE_MOSTLY_Q4_0: quantized_type = GGML_TYPE_Q4_0; break; 

 case LLAMA_FTYPE_MOSTLY_Q4_1: quantized_type = GGML_TYPE_Q4_1; break; 

 case LLAMA_FTYPE_MOSTLY_Q5_0: quantized_type = GGML_TYPE_Q5_0; break; 

 case LLAMA_FTYPE_MOSTLY_Q5_1: quantized_type = GGML_TYPE_Q5_1; break; 

 case LLAMA_FTYPE_MOSTLY_Q8_0: quantized_type = GGML_TYPE_Q8_0; break; 

 default: throw format("invalid output file type %d\n", ftype); 

 }; 

 if (nthread <= 0) { 

 nthread = std::thread::hardware_concurrency(); 

 } 

 std::unique_ptr<llama_model_loader> model_loader(new llama_model_loader(fname_inp, /*use_mmap*/ false, 

 /*vocab_only*/ false)); 

 llama_file_saver file_saver(fname_out.c_str(), model_loader->file_loaders.at(0).get(), ftype);

llama.cpp/llama.cpp

Line 557 in ac7876a

file.write_u32(new_ftype);

Loaders which are expecting the quantization version, like llm, detect a quantization version of 0:

     Running `target/release/llm llama info -m models/llama/7B/koala-7B.ggmlv3.q5_1.bin`
[2023-05-25T00:10:05Z INFO  llm] Container type: Ggjt(3)
[2023-05-25T00:10:05Z INFO  llm] Hyperparameters: Hyperparameters { n_vocab: 32000, n_embd: 4096, n_mult: 256, n_head: 32, n_layer: 32, n_rot: 128, file_type: FileType { format: MostlyQ5_1, quantization_version: 0 } }
[2023-05-25T00:10:05Z INFO  llm] Vocabulary size: 32000

Environment and Context

This was reproduced on ac7876a. I initially detected this when testing with one of the models on HuggingFace, then re-quantized a model locally to test it for myself.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

make
./quantize ggml-model-f16.bin ggml-model-f16-q4_0.bin q4_0
Check the ftype in the written hyperparameters.

The text was updated successfully, but these errors were encountered:

mgroeber9110 · 2023-05-28T09:30:16Z

Looking at llm it seems that the intended format for ftype is

ftype = new_type + GGML_QNT_VERSION*GGML_QNT_VERSION_FACTOR

As far as I can see, this also requires updating the loader in llama.cpp (read_hparams), as this currently does not strip the quantization version from the ftype:

hparams.ftype = (enum llama_ftype) file.read_u32();

So, I think this would need to be treated as a breaking change, as the code currently neither writes nor expects to read GGML_QNT_VERSION - right now, GGML_QNT_VERSION_FACTOR is completely ignored.

Should read_hparams throw an exception if the quantization_version in the file being loaded does not match GGML_QNT_VERSION of the current code?

philpax · 2023-05-28T12:25:27Z

What we did was to treat GGJTv2 as QNT1 and GGJTv3 as QNT2;
https://github.com/rustformers/llm/blob/main/crates/llm-base/src/loader.rs#L386-L394

The llama.cpp loader can make the same assumption and behave correctly from now on (including raising an error if the versions don't match). The only point of concern is that other llama.cpp-based loaders aren't going to modulo the ftype until they update, so they'll read an invalid ftype.

In practice, I'm not sure if this would be an issue; iirc the ftype isn't actually used much, as the tensors carry their own type and ftype isn't used to decide anything important. (I could be wrong! I haven't looked into this recently)

@LostRuins would koboldcpp have any issues with this? Does it attach any existing semantic meaning to the ftype?

LostRuins · 2023-05-28T14:11:37Z

@philpax No, there are no issue with determining ftype in the file for me so far.

Modulo for ftype is only required for ggml magic files (0x67676d6c), not ggjt (0x67676a74), for ggjt I just read the magic, ftype and file version directly.

This is because ggjt stores ftype and magic and version (LLAMA_FILE_VERSION) separately https://github.com/ggerganov/llama.cpp/blob/master/llama.cpp#L545, whereas ggml applies multiplexing to store ftype and version within the same field. Ref: ggerganov/ggml#150 (comment)

So you should not be using GGML_QNT_VERSION_FACTOR at all for ggjt, it is meant for ggml

My implementation: https://github.com/LostRuins/koboldcpp/blob/concedo/model_adapter.cpp#L220

philpax · 2023-05-28T16:28:10Z

I suspect that wasn't an intentional design choice, as GGML_QNT_VERSION_FACTOR is used for all of the other model implementations and there's nothing stopping the use of GGJTv2/3 with those models (llm can load both GGML and GGJTv3). @ggerganov can you confirm what the ideal behaviour should be here?

(If this leads to a format break - which I hope it doesn't / if it does, it's seamless - I'd really strongly suggest splitting up the ftype/qnt version, there's no reason to multiplex if strict compatibility isn't necessary)

ggerganov · 2023-05-28T17:30:50Z

The explanation by @LostRuins is correct - for ggjt format used by LLaMA derivatives, the version is stored separately.
For the ggml format, there wasn't a version field, so adding it would have caused breaking change. Therefore, we decided to encode GGML_QNT_VERSION into the ftype field by multiplying with GGML_QNT_VERSION_FACTOR.

We can do the same encoding for the ftype field in ggjt in order to be consistent in the way we store ftype.
This will not be a breaking change.

But adding a version to ggml would be a breaking change. I'm ok with that, but I think it's better to avoid it

philpax · 2023-05-28T19:29:44Z

Ah okay, that makes sense. Well, given that, it'd be nice to write the quantization version in the ftype, but it doesn't matter too much as we can determine it from the format version.

With that being said, if #1575 (comment) / ggerganov/ggml#147 gets implemented, I'd suggest moving the file format loader/saver to ggml, and sharing the format code, including Python conversion code, between all models including llama.cpp.

This would have a few benefits:

All model architectures can then use GGJTv4, and benefit from mmap compatibility
The quantization version can be separated as one of the fields
There's much less duplicated code floating around: it will all be the same model conversion and quantization code
The same format can be used for new architectures, and all existing programs will be able to load that format and tell the user that it's unsupported, instead of guessing

LostRuins · 2023-05-29T03:59:01Z

@philpax I'm personally more in the camp of - if it's not broken don't fix it - so given that the version+ftype multiplex was already added to existing ggml and currently working I'm not really in favor of changing it.

I'm wondering what you mean by "tell user it's unsupported instead of guessing", if everything uses the same ggjt magic you still won't be able to tell a neoX from a ggjt model except by maybe binning the vocab ranges.

Edit:

That said, switching from a packed struct to the Key-value metadata you suggested in #1575 seems like a good idea if it can be implemented robustly. A few fields should be enforced for all models for a model to be considered compliant. Certainly would be good to definitively tell apart the different architectures, such as GPT2 and GPTJ.

A final magic change should be considered if this ends up being implemented.

philpax · 2023-05-29T09:02:13Z

Yup, you got it - if we have kv metadata, we can store the architecture in there too, and that should finally solve the issues we've been having around identifying models.

LostRuins · 2023-05-29T09:21:33Z

But it has to be consistent. Leaving the standard freely extensible but undefined can rapidly lead to fracturing formats as each implementation adds their own keys. That's how you get monstrosities in modern javascript such as

var event = event || window.event; var charCode = event.keyCode || event.which; when every browser adds their own approach.

Last we need is for some models to use "ftype" and others to use "filetype", "quanttype", "qType" and "quantTypeFormat".

Random ideas: Should the keys be stored as an ASCII string (fixed length? nul terminated?) Will it handle unicode? (JSON keys permit that) What's your opinion on case insensitive keys?

philpax · 2023-05-29T10:28:57Z

Hm yeah, good points. Okay, how about snake_case ascii keys with no null termination (length comes before the string, and I think that's consistent with the tensors)? If there's demand for Unicode keys, that can be added later, but I've not seen much of that in the HF configs. The values can store as much UTF-8 as they want, but the keys are constrained to snake case.

I don't want to be too prescriptive on key length, but 256 characters max should be more than sufficient. You could maaaybe exceed that if you were specifying parameters for individual tensors and needed to include their full name, but I think that's a problem we'll deal with when we get to it.

By the way, we're having this discussion across several issues. @ggerganov do you mind if I create one issue on ggml for this proposed new model format, and then we can close this / #1575 / ggerganov/ggml#147 ?

ggerganov · 2023-05-31T04:59:57Z

@philpax
Yes, lets consolidate the discussion into a new ggml issue. Start with the idea from here: #1575 (comment)

ekdnam · 2023-06-05T08:33:01Z

I would like to solve this. Can anyone please guide?

ggerganov · 2023-07-28T19:23:42Z

#2398

philpax mentioned this issue May 25, 2023

Update GGML Dependency & Bindings rustformers/llm#275

Merged

ggerganov added good first issue Good for newcomers high priority Very important issue labels May 25, 2023

philpax mentioned this issue May 31, 2023

ggml : unified file format ggerganov/ggml#220

Closed

ggerganov closed this as completed Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization does not write the quantization version to `ftype` #1590

Quantization does not write the quantization version to `ftype` #1590

philpax commented May 25, 2023

mgroeber9110 commented May 28, 2023 •

edited

Loading

philpax commented May 28, 2023

LostRuins commented May 28, 2023 •

edited

Loading

philpax commented May 28, 2023

ggerganov commented May 28, 2023 •

edited

Loading

philpax commented May 28, 2023

LostRuins commented May 29, 2023 •

edited

Loading

philpax commented May 29, 2023

LostRuins commented May 29, 2023

philpax commented May 29, 2023

ggerganov commented May 31, 2023

ekdnam commented Jun 5, 2023 •

edited

Loading

ggerganov commented Jul 28, 2023

Quantization does not write the quantization version to ftype #1590

Quantization does not write the quantization version to ftype #1590

Comments

philpax commented May 25, 2023

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

mgroeber9110 commented May 28, 2023 • edited Loading

philpax commented May 28, 2023

LostRuins commented May 28, 2023 • edited Loading

philpax commented May 28, 2023

ggerganov commented May 28, 2023 • edited Loading

philpax commented May 28, 2023

LostRuins commented May 29, 2023 • edited Loading

philpax commented May 29, 2023

LostRuins commented May 29, 2023

philpax commented May 29, 2023

ggerganov commented May 31, 2023

ekdnam commented Jun 5, 2023 • edited Loading

ggerganov commented Jul 28, 2023

Quantization does not write the quantization version to `ftype` #1590

Quantization does not write the quantization version to `ftype` #1590

mgroeber9110 commented May 28, 2023 •

edited

Loading

LostRuins commented May 28, 2023 •

edited

Loading

ggerganov commented May 28, 2023 •

edited

Loading

LostRuins commented May 29, 2023 •

edited

Loading

ekdnam commented Jun 5, 2023 •

edited

Loading