-
Notifications
You must be signed in to change notification settings - Fork 966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bert.cpp q4_0 performance degradation from commit abea4b7 #122
Comments
What is the CPU that you use? |
Ryzen 5 5600X on Linux |
Hm, strange - I don't see anything that can affect the AVX |
I was running it with 8 threads. Is it expected that Q4_1 is faster now? Previously it was the other way around so I kind of thought Q4_1 is the higher quality, slower quantization. |
Your expectation is correct - I think I know what might be causing the regression:
For LLaMA, the mat mul computation is dominated by the dot products, so the In short, to improve Line 1436 in 5f9f1c1
Similar to |
That makes sense. So the Q8 mat mul make the dot products faster, but with small matrices (like [384, 384] in bert MiniLM) the conversions add significant overhead. Thanks! I guess this issue can be closed if the vectorization effort is tracked somewhere else? In other news, I rewrote the bert.cpp API with mock support for batching and checked how the attention masking, etc. works in the reference python implementation. So that should soon be ready for development of the batched OPs :) |
Great - hopefully we add batched support soon! |
abea4b7
With this commit, the runtimes for my benchmarks in bert.cpp went from 6s to 10s for Q4_0
Meanwhile Q4_1 got a lot faster (from 13s to 8s, so pretty good but still slower than Q4_0 used to be)
Attached are GGML_PERF outputs for a single prompt evaluations with Q4_0. Slow with the offending commit and fast with the preceeding commit.
The commit itself is quite a big one, so I can't really tell what is happening. Any ideas?
slow_abea4b7.txt
fast_32f22c0.txt
The text was updated successfully, but these errors were encountered: