-
Notifications
You must be signed in to change notification settings - Fork 964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiplying large matrices for batched BERT inference #412
Comments
If you disable OpenBLAS it will work. |
Thanks, disabling Accelerate makes it work. I'm guessing disabling Accelerate/OpenBLAS also makes everything else slower? |
On Apple Silicon yes. On x64 - depends on the case, but if you use max threads, should be similar or better performance depending if quantization is used. The BLAS branch in |
Ok, I'll keep an eye on that PR. My preliminary tests show that inference for batch size = 32 is 2-3x slower than PyTorch CPU (on Apple Silicon). I'm probably leaving a lot of performance on the table, though. I'll have to figure out if matrix multiplication is the bottleneck. |
I'm trying to implement batched BERT inference based on the https://github.com/skeskinen/bert.cpp project. I'm running into the following assert error:
ggml/src/ggml.c
Line 10441 in 3dd91c6
I believe the issue is in the matrix multiplication with the following dimensions:
The location in the original code:
https://github.com/skeskinen/bert.cpp/blob/d9f04e609fb7f7e5fb3b20a77d4d685219971009/bert.cpp#L824-L827
Batch sizes <= 4 and fewer tokens run correctly when compared against PyTorch embeddings.
Am I doing something wrong with my matrix dimensions or is this missing functionality? If the latter, I'm happy to look into a fix with some help 🙂
The text was updated successfully, but these errors were encountered: