Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiplying large matrices for batched BERT inference #412

Open
novoselrok opened this issue Jul 24, 2023 · 4 comments
Open

Multiplying large matrices for batched BERT inference #412

novoselrok opened this issue Jul 24, 2023 · 4 comments

Comments

@novoselrok
Copy link

I'm trying to implement batched BERT inference based on the https://github.com/skeskinen/bert.cpp project. I'm running into the following assert error:

ggml/src/ggml.c

Line 10441 in 3dd91c6

GGML_ASSERT(ne02 == ne12);

I believe the issue is in the matrix multiplication with the following dimensions:

ggml_mul_mat(ctx0, model.layers[il].q_w, Qcur)
// QCur:  768, 92, 30, 1 (Emb. dim., tokens, batch size)
// model.layers[il].q_w: 768, 768, 1, 1 (Linear layer)

The location in the original code:

https://github.com/skeskinen/bert.cpp/blob/d9f04e609fb7f7e5fb3b20a77d4d685219971009/bert.cpp#L824-L827

Batch sizes <= 4 and fewer tokens run correctly when compared against PyTorch embeddings.

Am I doing something wrong with my matrix dimensions or is this missing functionality? If the latter, I'm happy to look into a fix with some help 🙂

@ggerganov
Copy link
Owner

If you disable OpenBLAS it will work.
On MacOS - disable ACCELERATE

@novoselrok
Copy link
Author

Thanks, disabling Accelerate makes it work. I'm guessing disabling Accelerate/OpenBLAS also makes everything else slower?

@ggerganov
Copy link
Owner

ggerganov commented Jul 25, 2023

On Apple Silicon yes. On x64 - depends on the case, but if you use max threads, should be similar or better performance depending if quantization is used.

The BLAS branch in forward_mul_mat has to be updated to support broadcast so you don't need to disable Accelerate.
But my advice is to wait for ggerganov/llama.cpp#2372 to be merged, and after that add the broadcast support. I will probably add it in that PR if I have the time

@novoselrok
Copy link
Author

Ok, I'll keep an eye on that PR.

My preliminary tests show that inference for batch size = 32 is 2-3x slower than PyTorch CPU (on Apple Silicon). I'm probably leaving a lot of performance on the table, though. I'll have to figure out if matrix multiplication is the bottleneck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants