Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for quantized zero degradation matrix multiplication for Large Language Models #440

Open
ThePerfectComputer opened this issue Aug 8, 2023 · 4 comments

Comments

@ThePerfectComputer
Copy link

ThePerfectComputer commented Aug 8, 2023

Hugging Face's bitsandbytes library allows for partial quantization of models, by partitioning computation graphs according to an outlier threshold. Their paper indicates that they achieve nearly as much memory compression as standard full quantization with barely any change in perplexity.

The tradeoff is a roughly 25% hit to model performance.

Does ggml have support for this?

https://huggingface.co/blog/hf-bitsandbytes-integration

@ThePerfectComputer
Copy link
Author

If not, would this be something I could/should work on?

@ThePerfectComputer
Copy link
Author

Any thoughts on this? I'm curious is there's interest in supporting this, and if so, perhaps I can take a stab at it.

@slaren
Copy link
Collaborator

slaren commented Aug 9, 2023

Maybe I am missing something, but ggml already supports 8-bit quantization as q8_0, and at least with the llama models, the increase in perplexity is very low. Nonetheless, if you implement it and it provides tangible benefits, I think that the chances of it being merged are very high, but ultimately that's up to @ggerganov. It may be better to do it in the llama.cpp repository though, as that's where most of the development of new features is happening currently.

@ThePerfectComputer
Copy link
Author

OK. Good to know. I know that ggml supports int8. But when reviewing the code, I didn't see anything for mixed precision matmuls which is essentially what HG does when you specify load_in_8bit = True.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants