You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hugging Face's bitsandbytes library allows for partial quantization of models, by partitioning computation graphs according to an outlier threshold. Their paper indicates that they achieve nearly as much memory compression as standard full quantization with barely any change in perplexity.
The tradeoff is a roughly 25% hit to model performance.
Maybe I am missing something, but ggml already supports 8-bit quantization as q8_0, and at least with the llama models, the increase in perplexity is very low. Nonetheless, if you implement it and it provides tangible benefits, I think that the chances of it being merged are very high, but ultimately that's up to @ggerganov. It may be better to do it in the llama.cpp repository though, as that's where most of the development of new features is happening currently.
OK. Good to know. I know that ggml supports int8. But when reviewing the code, I didn't see anything for mixed precision matmuls which is essentially what HG does when you specify load_in_8bit = True.
Hugging Face's bitsandbytes library allows for partial quantization of models, by partitioning computation graphs according to an outlier threshold. Their paper indicates that they achieve nearly as much memory compression as standard full quantization with barely any change in perplexity.
The tradeoff is a roughly 25% hit to model performance.
Does ggml have support for this?
https://huggingface.co/blog/hf-bitsandbytes-integration
The text was updated successfully, but these errors were encountered: