-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors #995
Closed
2 of 3 tasks
Labels
Comments
This was referenced Apr 17, 2023
This approach resulted in the new The remaining bits and pieces to complete this task will be summarized together with other things in a separate issue |
Deadsg
pushed a commit
to Deadsg/llama.cpp
that referenced
this issue
Dec 19, 2023
F16_KV appears to have been removed here: ggerganov@af99c6f This addresses two issues: - ggerganov#995 which just requests to add the KV cache offloading param - ggerganov#1006 a NULL ptr exception when using the embeddings (introduced by leaving f16_kv in the fields struct)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
The current
Q4_0
uses a single F32 floating-point scaling factor.An idea was proposed by @ikawrakow to change this to use 2x F16 factors instead of 1x F32: 679e1cb
Initial results indicate that this might be as accurate as
Q4_1
and hopefully as fast as currentQ4_0
.The goal of this task is to try to implement efficiently this data format (quantization, dequantization and dot product), measure the speed and perplexity and decide if this is viable. Depending on the results, we can think about updating the current
Q4_0
data format and potentially dropping support forQ4_1
.SIMD implementation progress
I plan to work on the ARM NEON implementation.
If you want to help with any of the implementations, propose an implementation + results in a PR, summarizing the inference speed and the obtained perplexity of your implementation.
Related
The text was updated successfully, but these errors were encountered: