Support for quantized zero degradation matrix multiplication for Large Language Models #440

ThePerfectComputer · 2023-08-08T06:45:04Z

Hugging Face's bitsandbytes library allows for partial quantization of models, by partitioning computation graphs according to an outlier threshold. Their paper indicates that they achieve nearly as much memory compression as standard full quantization with barely any change in perplexity.

The tradeoff is a roughly 25% hit to model performance.

Does ggml have support for this?

https://huggingface.co/blog/hf-bitsandbytes-integration

ThePerfectComputer · 2023-08-08T06:45:49Z

If not, would this be something I could/should work on?

ThePerfectComputer · 2023-08-09T22:36:19Z

Any thoughts on this? I'm curious is there's interest in supporting this, and if so, perhaps I can take a stab at it.

slaren · 2023-08-09T23:05:35Z

Maybe I am missing something, but ggml already supports 8-bit quantization as q8_0, and at least with the llama models, the increase in perplexity is very low. Nonetheless, if you implement it and it provides tangible benefits, I think that the chances of it being merged are very high, but ultimately that's up to @ggerganov. It may be better to do it in the llama.cpp repository though, as that's where most of the development of new features is happening currently.

ThePerfectComputer · 2023-08-10T19:53:58Z

OK. Good to know. I know that ggml supports int8. But when reviewing the code, I didn't see anything for mixed precision matmuls which is essentially what HG does when you specify load_in_8bit = True.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for quantized zero degradation matrix multiplication for Large Language Models #440

Support for quantized zero degradation matrix multiplication for Large Language Models #440

ThePerfectComputer commented Aug 8, 2023 •

edited

Loading

ThePerfectComputer commented Aug 8, 2023

ThePerfectComputer commented Aug 9, 2023

slaren commented Aug 9, 2023 •

edited

Loading

ThePerfectComputer commented Aug 10, 2023

Support for quantized zero degradation matrix multiplication for Large Language Models #440

Support for quantized zero degradation matrix multiplication for Large Language Models #440

Comments

ThePerfectComputer commented Aug 8, 2023 • edited Loading

ThePerfectComputer commented Aug 8, 2023

ThePerfectComputer commented Aug 9, 2023

slaren commented Aug 9, 2023 • edited Loading

ThePerfectComputer commented Aug 10, 2023

ThePerfectComputer commented Aug 8, 2023 •

edited

Loading

slaren commented Aug 9, 2023 •

edited

Loading