Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bert.cpp q4_0 performance degradation from commit abea4b7 #122

Closed
skeskinen opened this issue Apr 30, 2023 · 7 comments
Closed

bert.cpp q4_0 performance degradation from commit abea4b7 #122

skeskinen opened this issue Apr 30, 2023 · 7 comments

Comments

@skeskinen
Copy link
Contributor

abea4b7

With this commit, the runtimes for my benchmarks in bert.cpp went from 6s to 10s for Q4_0

Meanwhile Q4_1 got a lot faster (from 13s to 8s, so pretty good but still slower than Q4_0 used to be)

Attached are GGML_PERF outputs for a single prompt evaluations with Q4_0. Slow with the offending commit and fast with the preceeding commit.

The commit itself is quite a big one, so I can't really tell what is happening. Any ideas?

slow_abea4b7.txt
fast_32f22c0.txt

@ggerganov
Copy link
Owner

What is the CPU that you use?

@skeskinen
Copy link
Contributor Author

Ryzen 5 5600X on Linux

@ggerganov
Copy link
Owner

Hm, strange - I don't see anything that can affect the AVX Q4_0 performance in that commit.
Btw, are you running the benchmarks with 6 threads?

@skeskinen
Copy link
Contributor Author

I was running it with 8 threads.
Lowering to 6 threads, both Q4_0 and Q4_1 get significantly faster.
Q4_0: 10s -> 7.5s
Q4_1: 7.5s -> 6s

Is it expected that Q4_1 is faster now? Previously it was the other way around so I kind of thought Q4_1 is the higher quality, slower quantization.

@ggerganov
Copy link
Owner

Your expectation is correct - Q4_1 should be slower than Q4_0 and should have higher quality

I think I know what might be causing the regression:
ggml quantizes intermediate F32 results to Q8 in order to perform integer-based matrix multiplications:

  • Q4_0 mat mul uses the Q8_0 quantization which is not vectorized yet (see quantize_row_q8_0())
  • Q4_1 mat mul uses the Q8_1 quantization which is vectorized (see quantize_row_q8_1), so it is faster

For LLaMA, the mat mul computation is dominated by the dot products, so the Q8 quantization parts remain negligible and hence we postponed the vectorization Q8_0 as low-priority. But I think for BERT the quantization becomes significant

In short, to improve Q4_0 performance we should vectorize the following function:

ggml/src/ggml.c

Line 1436 in 5f9f1c1

static void quantize_row_q8_0(const float * restrict x, void * restrict vy, int k) {

Similar to quantize_row_q8_1()

@skeskinen
Copy link
Contributor Author

That makes sense. So the Q8 mat mul make the dot products faster, but with small matrices (like [384, 384] in bert MiniLM) the conversions add significant overhead.

Thanks! I guess this issue can be closed if the vectorization effort is tracked somewhere else?

In other news, I rewrote the bert.cpp API with mock support for batching and checked how the attention masking, etc. works in the reference python implementation. So that should soon be ready for development of the batched OPs :)

@ggerganov
Copy link
Owner

In other news, I rewrote the bert.cpp API with mock support for batching and checked how the attention masking, etc. works in the reference python implementation. So that should soon be ready for development of the batched OPs :)

Great - hopefully we add batched support soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants