Multiplying large matrices for batched BERT inference #412

novoselrok · 2023-07-24T16:33:02Z

I'm trying to implement batched BERT inference based on the https://github.com/skeskinen/bert.cpp project. I'm running into the following assert error:

ggml/src/ggml.c

Line 10441 in 3dd91c6

GGML_ASSERT(ne02 == ne12);

I believe the issue is in the matrix multiplication with the following dimensions:

ggml_mul_mat(ctx0, model.layers[il].q_w, Qcur)
// QCur:  768, 92, 30, 1 (Emb. dim., tokens, batch size)
// model.layers[il].q_w: 768, 768, 1, 1 (Linear layer)

The location in the original code:

https://github.com/skeskinen/bert.cpp/blob/d9f04e609fb7f7e5fb3b20a77d4d685219971009/bert.cpp#L824-L827

Batch sizes <= 4 and fewer tokens run correctly when compared against PyTorch embeddings.

Am I doing something wrong with my matrix dimensions or is this missing functionality? If the latter, I'm happy to look into a fix with some help 🙂

ggerganov · 2023-07-25T15:21:53Z

If you disable OpenBLAS it will work.
On MacOS - disable ACCELERATE

novoselrok · 2023-07-25T16:21:48Z

Thanks, disabling Accelerate makes it work. I'm guessing disabling Accelerate/OpenBLAS also makes everything else slower?

ggerganov · 2023-07-25T16:30:49Z

On Apple Silicon yes. On x64 - depends on the case, but if you use max threads, should be similar or better performance depending if quantization is used.

The BLAS branch in forward_mul_mat has to be updated to support broadcast so you don't need to disable Accelerate.
But my advice is to wait for ggerganov/llama.cpp#2372 to be merged, and after that add the broadcast support. I will probably add it in that PR if I have the time

novoselrok · 2023-07-25T18:04:14Z

Ok, I'll keep an eye on that PR.

My preliminary tests show that inference for batch size = 32 is 2-3x slower than PyTorch CPU (on Apple Silicon). I'm probably leaving a lot of performance on the table, though. I'll have to figure out if matrix multiplication is the bottleneck.

ggerganov mentioned this issue Jul 26, 2023

ggml : mul mat tweaks ggerganov/llama.cpp#2372

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiplying large matrices for batched BERT inference #412

Multiplying large matrices for batched BERT inference #412

novoselrok commented Jul 24, 2023

ggerganov commented Jul 25, 2023

novoselrok commented Jul 25, 2023

ggerganov commented Jul 25, 2023 •

edited

Loading

novoselrok commented Jul 25, 2023

Multiplying large matrices for batched BERT inference #412

Multiplying large matrices for batched BERT inference #412

Comments

novoselrok commented Jul 24, 2023

ggerganov commented Jul 25, 2023

novoselrok commented Jul 25, 2023

ggerganov commented Jul 25, 2023 • edited Loading

novoselrok commented Jul 25, 2023

ggerganov commented Jul 25, 2023 •

edited

Loading