Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantization does not write the quantization version to ftype #1590

Closed
philpax opened this issue May 25, 2023 · 13 comments
Closed

Quantization does not write the quantization version to ftype #1590

philpax opened this issue May 25, 2023 · 13 comments
Labels
good first issue Good for newcomers high priority Very important issue

Comments

@philpax
Copy link

philpax commented May 25, 2023

Expected Behavior

When quantizing with llama.cpp, the quantization version should be written to the ftype in the hyperparameters.

Current Behavior

A ftype is produced by llama_model_quantize_internal and is passed through as-is to llama_file_saver, which writes it to disk without encoding it using GGML_QNT_VERSION:

llama.cpp/llama.cpp

Lines 2052 to 2068 in ac7876a

ggml_type quantized_type;
switch (ftype) {
case LLAMA_FTYPE_MOSTLY_Q4_0: quantized_type = GGML_TYPE_Q4_0; break;
case LLAMA_FTYPE_MOSTLY_Q4_1: quantized_type = GGML_TYPE_Q4_1; break;
case LLAMA_FTYPE_MOSTLY_Q5_0: quantized_type = GGML_TYPE_Q5_0; break;
case LLAMA_FTYPE_MOSTLY_Q5_1: quantized_type = GGML_TYPE_Q5_1; break;
case LLAMA_FTYPE_MOSTLY_Q8_0: quantized_type = GGML_TYPE_Q8_0; break;
default: throw format("invalid output file type %d\n", ftype);
};
if (nthread <= 0) {
nthread = std::thread::hardware_concurrency();
}
std::unique_ptr<llama_model_loader> model_loader(new llama_model_loader(fname_inp, /*use_mmap*/ false,
/*vocab_only*/ false));
llama_file_saver file_saver(fname_out.c_str(), model_loader->file_loaders.at(0).get(), ftype);

file.write_u32(new_ftype);

Loaders which are expecting the quantization version, like llm, detect a quantization version of 0:

     Running `target/release/llm llama info -m models/llama/7B/koala-7B.ggmlv3.q5_1.bin`
[2023-05-25T00:10:05Z INFO  llm] Container type: Ggjt(3)
[2023-05-25T00:10:05Z INFO  llm] Hyperparameters: Hyperparameters { n_vocab: 32000, n_embd: 4096, n_mult: 256, n_head: 32, n_layer: 32, n_rot: 128, file_type: FileType { format: MostlyQ5_1, quantization_version: 0 } }
[2023-05-25T00:10:05Z INFO  llm] Vocabulary size: 32000

Environment and Context

This was reproduced on ac7876a. I initially detected this when testing with one of the models on HuggingFace, then re-quantized a model locally to test it for myself.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. make
  2. ./quantize ggml-model-f16.bin ggml-model-f16-q4_0.bin q4_0
  3. Check the ftype in the written hyperparameters.
@mgroeber9110
Copy link
Contributor

mgroeber9110 commented May 28, 2023

Looking at llm it seems that the intended format for ftype is

ftype = new_type + GGML_QNT_VERSION*GGML_QNT_VERSION_FACTOR

As far as I can see, this also requires updating the loader in llama.cpp (read_hparams), as this currently does not strip the quantization version from the ftype:

hparams.ftype = (enum llama_ftype) file.read_u32();

So, I think this would need to be treated as a breaking change, as the code currently neither writes nor expects to read GGML_QNT_VERSION - right now, GGML_QNT_VERSION_FACTOR is completely ignored.

Should read_hparams throw an exception if the quantization_version in the file being loaded does not match GGML_QNT_VERSION of the current code?

@philpax
Copy link
Author

philpax commented May 28, 2023

What we did was to treat GGJTv2 as QNT1 and GGJTv3 as QNT2;
https://github.com/rustformers/llm/blob/main/crates/llm-base/src/loader.rs#L386-L394

The llama.cpp loader can make the same assumption and behave correctly from now on (including raising an error if the versions don't match). The only point of concern is that other llama.cpp-based loaders aren't going to modulo the ftype until they update, so they'll read an invalid ftype.

In practice, I'm not sure if this would be an issue; iirc the ftype isn't actually used much, as the tensors carry their own type and ftype isn't used to decide anything important. (I could be wrong! I haven't looked into this recently)

@LostRuins would koboldcpp have any issues with this? Does it attach any existing semantic meaning to the ftype?

@LostRuins
Copy link
Collaborator

LostRuins commented May 28, 2023

@philpax No, there are no issue with determining ftype in the file for me so far.

Modulo for ftype is only required for ggml magic files (0x67676d6c), not ggjt (0x67676a74), for ggjt I just read the magic, ftype and file version directly.

This is because ggjt stores ftype and magic and version (LLAMA_FILE_VERSION) separately https://github.com/ggerganov/llama.cpp/blob/master/llama.cpp#L545, whereas ggml applies multiplexing to store ftype and version within the same field. Ref: ggerganov/ggml#150 (comment)

So you should not be using GGML_QNT_VERSION_FACTOR at all for ggjt, it is meant for ggml

My implementation: https://github.com/LostRuins/koboldcpp/blob/concedo/model_adapter.cpp#L220

@philpax
Copy link
Author

philpax commented May 28, 2023

I suspect that wasn't an intentional design choice, as GGML_QNT_VERSION_FACTOR is used for all of the other model implementations and there's nothing stopping the use of GGJTv2/3 with those models (llm can load both GGML and GGJTv3). @ggerganov can you confirm what the ideal behaviour should be here?

(If this leads to a format break - which I hope it doesn't / if it does, it's seamless - I'd really strongly suggest splitting up the ftype/qnt version, there's no reason to multiplex if strict compatibility isn't necessary)

@ggerganov
Copy link
Owner

ggerganov commented May 28, 2023

The explanation by @LostRuins is correct - for ggjt format used by LLaMA derivatives, the version is stored separately.
For the ggml format, there wasn't a version field, so adding it would have caused breaking change. Therefore, we decided to encode GGML_QNT_VERSION into the ftype field by multiplying with GGML_QNT_VERSION_FACTOR.

We can do the same encoding for the ftype field in ggjt in order to be consistent in the way we store ftype.
This will not be a breaking change.

But adding a version to ggml would be a breaking change. I'm ok with that, but I think it's better to avoid it

@philpax
Copy link
Author

philpax commented May 28, 2023

Ah okay, that makes sense. Well, given that, it'd be nice to write the quantization version in the ftype, but it doesn't matter too much as we can determine it from the format version.

With that being said, if #1575 (comment) / ggerganov/ggml#147 gets implemented, I'd suggest moving the file format loader/saver to ggml, and sharing the format code, including Python conversion code, between all models including llama.cpp.

This would have a few benefits:

  • All model architectures can then use GGJTv4, and benefit from mmap compatibility
  • The quantization version can be separated as one of the fields
  • There's much less duplicated code floating around: it will all be the same model conversion and quantization code
  • The same format can be used for new architectures, and all existing programs will be able to load that format and tell the user that it's unsupported, instead of guessing

@LostRuins
Copy link
Collaborator

LostRuins commented May 29, 2023

@philpax I'm personally more in the camp of - if it's not broken don't fix it - so given that the version+ftype multiplex was already added to existing ggml and currently working I'm not really in favor of changing it.

I'm wondering what you mean by "tell user it's unsupported instead of guessing", if everything uses the same ggjt magic you still won't be able to tell a neoX from a ggjt model except by maybe binning the vocab ranges.

Edit:

That said, switching from a packed struct to the Key-value metadata you suggested in #1575 seems like a good idea if it can be implemented robustly. A few fields should be enforced for all models for a model to be considered compliant. Certainly would be good to definitively tell apart the different architectures, such as GPT2 and GPTJ.

A final magic change should be considered if this ends up being implemented.

@philpax
Copy link
Author

philpax commented May 29, 2023

Yup, you got it - if we have kv metadata, we can store the architecture in there too, and that should finally solve the issues we've been having around identifying models.

@LostRuins
Copy link
Collaborator

But it has to be consistent. Leaving the standard freely extensible but undefined can rapidly lead to fracturing formats as each implementation adds their own keys. That's how you get monstrosities in modern javascript such as

var event = event || window.event; var charCode = event.keyCode || event.which; when every browser adds their own approach.

Last we need is for some models to use "ftype" and others to use "filetype", "quanttype", "qType" and "quantTypeFormat".

Random ideas: Should the keys be stored as an ASCII string (fixed length? nul terminated?) Will it handle unicode? (JSON keys permit that) What's your opinion on case insensitive keys?

@philpax
Copy link
Author

philpax commented May 29, 2023

Hm yeah, good points. Okay, how about snake_case ascii keys with no null termination (length comes before the string, and I think that's consistent with the tensors)? If there's demand for Unicode keys, that can be added later, but I've not seen much of that in the HF configs. The values can store as much UTF-8 as they want, but the keys are constrained to snake case.

I don't want to be too prescriptive on key length, but 256 characters max should be more than sufficient. You could maaaybe exceed that if you were specifying parameters for individual tensors and needed to include their full name, but I think that's a problem we'll deal with when we get to it.

By the way, we're having this discussion across several issues. @ggerganov do you mind if I create one issue on ggml for this proposed new model format, and then we can close this / #1575 / ggerganov/ggml#147 ?

@ggerganov
Copy link
Owner

@philpax
Yes, lets consolidate the discussion into a new ggml issue. Start with the idea from here: #1575 (comment)

@ekdnam
Copy link

ekdnam commented Jun 5, 2023

I would like to solve this. Can anyone please guide?

@ggerganov
Copy link
Owner

#2398

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers high priority Very important issue
Projects
None yet
Development

No branches or pull requests

5 participants