Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors #995

Closed
2 of 3 tasks
ggerganov opened this issue Apr 15, 2023 · 1 comment
Closed
2 of 3 tasks
Assignees
Labels
help wanted Extra attention is needed high priority Very important issue research 🔬

Comments

@ggerganov
Copy link
Owner

ggerganov commented Apr 15, 2023

The current Q4_0 uses a single F32 floating-point scaling factor.

An idea was proposed by @ikawrakow to change this to use 2x F16 factors instead of 1x F32: 679e1cb
Initial results indicate that this might be as accurate as Q4_1 and hopefully as fast as current Q4_0.

The goal of this task is to try to implement efficiently this data format (quantization, dequantization and dot product), measure the speed and perplexity and decide if this is viable. Depending on the results, we can think about updating the current Q4_0 data format and potentially dropping support for Q4_1.

SIMD implementation progress

  • ARM NEON
  • AVX
  • WASM

I plan to work on the ARM NEON implementation.
If you want to help with any of the implementations, propose an implementation + results in a PR, summarizing the inference speed and the obtained perplexity of your implementation.

Related

@ggerganov
Copy link
Owner Author

This approach resulted in the new Q4_2 and Q4_3 which improve the perplexity results and maintain similar inference speeds as the original Q4_0 and Q4_1 approaches.

The remaining bits and pieces to complete this task will be summarized together with other things in a separate issue

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
F16_KV appears to have been removed here: ggerganov@af99c6f

This addresses two issues:

 - ggerganov#995 which just requests to add the KV cache offloading param
 - ggerganov#1006 a NULL ptr exception when using the embeddings (introduced by
   leaving f16_kv in the fields struct)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed high priority Very important issue research 🔬
Development

Successfully merging a pull request may close this issue.

1 participant