ggml : add broadcast support for BLAS ggml_mul_mat() #460

ggerganov · 2023-08-19T07:08:49Z

For F16 and F32 CPU matrix multiplications BLAS will likely remain necessary, especially on Apple Silicon where we have Accelerate CBLAS using the AMX coprocessor. So we need to resolve the following TODOs:

ggml/src/ggml.c

Lines 10796 to 10799 in 08c57df

 // TODO: handle case when src0 is broadcast-able into src1 across 2nd,3rd dimension 

 // ref: https://github.com/ggerganov/ggml/pull/224 

 GGML_ASSERT(ne02 == ne12); 

 GGML_ASSERT(ne03 == ne13);

ggml/src/ggml.c

Lines 10782 to 10785 in 08c57df

 // TODO: handle case when src0 is broadcast-able into src1 across 2nd,3rd dimension 

 // ref: https://github.com/ggerganov/ggml/pull/224 

 GGML_ASSERT(ne02 == ne12); 

 GGML_ASSERT(ne03 == ne13);

This will fix SAM inference on Apple Silicon: #459

The broadcast logic is implemented in the same function for the non-BLAS branch:

ggml/src/ggml.c

Lines 10905 to 10909 in 08c57df

 
 // broadcast factors 

 const int64_t r2 = ne12/ne02; 

 const int64_t r3 = ne13/ne03;

Should be simple enough to adapt, but need to make sure it works correctly. Probably the best way to test it is to use the SAM inference example, which works without BLAS but asserts with BLAS.

…t BLAS mul_mat (#466) * Improve ADD_REL_POS perf in SAM by doing it inplace - Add unit tests for the ADD_REL_POS operation - I am not sure if this is valid implementation as we reuse the src0 memory in order to avoid copying it - When running SAM with the "Example output" command, image, point and 16 threads, this reduces the cumulative time of the ADD_REL_POS operation from 1000-1100 ms to 180-200ms - There is further room for optimization in the access patterns used in the implementation of the opration * Add non-inplace version for the GGML_OP_ADD_REL_POS * Fix map_unary warnings and refactor LayerNorm2d + remove ggml_cont in it * Fix Mac printf format warnings * sam : add ggml_graph_print() comment * ggml : add broadcast support for BLAS ggml_mul_mat() (#460) * Remove not needed build_forward_expand from add-rel-pos unit test --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggerganov added enhancement New feature or request good first issue Good for newcomers labels Aug 19, 2023

ggerganov added a commit that referenced this issue Aug 21, 2023

ggml : add broadcast support for BLAS ggml_mul_mat() (#460)

d4b3b46

ggerganov mentioned this issue Aug 21, 2023

Improve ADD_REL_POS perf in SAM by doing it inplace #466

Merged

ggerganov closed this as completed in #466 Aug 21, 2023

YavorGIvanov assigned ggerganov Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : add broadcast support for BLAS ggml_mul_mat() #460

ggml : add broadcast support for BLAS ggml_mul_mat() #460

ggerganov commented Aug 19, 2023 •

edited

ggml : add broadcast support for BLAS ggml_mul_mat() #460

ggml : add broadcast support for BLAS ggml_mul_mat() #460

Comments

ggerganov commented Aug 19, 2023 • edited

ggerganov commented Aug 19, 2023 •

edited