bert.cpp q4_0 performance degradation from commit abea4b7 #122

skeskinen · 2023-04-30T21:26:59Z

abea4b7

With this commit, the runtimes for my benchmarks in bert.cpp went from 6s to 10s for Q4_0

Meanwhile Q4_1 got a lot faster (from 13s to 8s, so pretty good but still slower than Q4_0 used to be)

Attached are GGML_PERF outputs for a single prompt evaluations with Q4_0. Slow with the offending commit and fast with the preceeding commit.

The commit itself is quite a big one, so I can't really tell what is happening. Any ideas?

slow_abea4b7.txt
fast_32f22c0.txt

The text was updated successfully, but these errors were encountered:

ggerganov · 2023-05-01T05:58:04Z

What is the CPU that you use?

skeskinen · 2023-05-01T06:38:17Z

Ryzen 5 5600X on Linux

ggerganov · 2023-05-01T07:22:32Z

Hm, strange - I don't see anything that can affect the AVX Q4_0 performance in that commit.
Btw, are you running the benchmarks with 6 threads?

skeskinen · 2023-05-01T08:33:13Z

I was running it with 8 threads.
Lowering to 6 threads, both Q4_0 and Q4_1 get significantly faster.
Q4_0: 10s -> 7.5s
Q4_1: 7.5s -> 6s

Is it expected that Q4_1 is faster now? Previously it was the other way around so I kind of thought Q4_1 is the higher quality, slower quantization.

ggerganov · 2023-05-01T08:48:06Z

Your expectation is correct - Q4_1 should be slower than Q4_0 and should have higher quality

I think I know what might be causing the regression:
ggml quantizes intermediate F32 results to Q8 in order to perform integer-based matrix multiplications:

Q4_0 mat mul uses the Q8_0 quantization which is not vectorized yet (see quantize_row_q8_0())
Q4_1 mat mul uses the Q8_1 quantization which is vectorized (see quantize_row_q8_1), so it is faster

For LLaMA, the mat mul computation is dominated by the dot products, so the Q8 quantization parts remain negligible and hence we postponed the vectorization Q8_0 as low-priority. But I think for BERT the quantization becomes significant

In short, to improve Q4_0 performance we should vectorize the following function:

ggml/src/ggml.c

Line 1436 in 5f9f1c1

 static void quantize_row_q8_0(const float * restrict x, void * restrict vy, int k) { 

Similar to quantize_row_q8_1()

skeskinen · 2023-05-01T10:55:11Z

That makes sense. So the Q8 mat mul make the dot products faster, but with small matrices (like [384, 384] in bert MiniLM) the conversions add significant overhead.

Thanks! I guess this issue can be closed if the vectorization effort is tracked somewhere else?

In other news, I rewrote the bert.cpp API with mock support for batching and checked how the attention masking, etc. works in the reference python implementation. So that should soon be ready for development of the batched OPs :)

ggerganov · 2023-05-01T17:49:27Z

In other news, I rewrote the bert.cpp API with mock support for batching and checked how the attention masking, etc. works in the reference python implementation. So that should soon be ready for development of the batched OPs :)

Great - hopefully we add batched support soon!

ggerganov mentioned this issue May 1, 2023

Vectorize quantize_row_q8_0() #124

Closed

ggerganov closed this as completed May 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bert.cpp q4_0 performance degradation from commit abea4b7 #122

bert.cpp q4_0 performance degradation from commit abea4b7 #122

skeskinen commented Apr 30, 2023

ggerganov commented May 1, 2023

skeskinen commented May 1, 2023

ggerganov commented May 1, 2023

skeskinen commented May 1, 2023

ggerganov commented May 1, 2023

skeskinen commented May 1, 2023

ggerganov commented May 1, 2023

bert.cpp q4_0 performance degradation from commit abea4b7 #122

bert.cpp q4_0 performance degradation from commit abea4b7 #122

Comments

skeskinen commented Apr 30, 2023

ggerganov commented May 1, 2023

skeskinen commented May 1, 2023

ggerganov commented May 1, 2023

skeskinen commented May 1, 2023

ggerganov commented May 1, 2023

skeskinen commented May 1, 2023

ggerganov commented May 1, 2023