-
Notifications
You must be signed in to change notification settings - Fork 972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : full broadcast in mul, add, div + ggml_mul_mat_id, ggml_argsort, ggml_top_k #625
Conversation
It should work now, but still need to test the performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whisper.cpp
can go back to using a broadcasted ggml_add
for dim 0 after this change. Will update it and check the performance later today
I will test this in stable diffusion with cuda |
Works, but ... We will have to sacrifice some performance for the sake of maximizing compatibility. Your implementationUNET Model ========== CUDA Timings =========
[ ADD] - 38.222000 ms - 419 - 0.091222 ms
[ MUL] - 13.909000 ms - 125 - 0.111272 ms
[ CONCAT] - 2.308000 ms - 12 - 0.192333 ms
[ NORM] - 2.567000 ms - 48 - 0.053479 ms
[GROUP_NORM] - 5.310000 ms - 61 - 0.087049 ms
[ MUL_MAT] - 121.505997 ms - 361 - 0.336582 ms
[ SCALE] - 2.243000 ms - 32 - 0.070094 ms
[ CONT] - 11.947000 ms - 160 - 0.074669 ms
[ SOFT_MAX] - 102.802002 ms - 32 - 3.212563 ms
[ IM2COL] - 66.832001 ms - 97 - 0.688990 ms
[ UPSCALE] - 0.577000 ms - 3 - 0.192333 ms
[ UNARY] - 4.678000 ms - 84 - 0.055690 ms
Total Time: 372.901001 ms AutoEncoder ========== CUDA Timings =========
[ ADD] - 123.287003 ms - 85 - 1.450435 ms
[ MUL] - 48.032001 ms - 30 - 1.601067 ms
[GROUP_NORM] - 56.298000 ms - 30 - 1.876600 ms
[ MUL_MAT] - 167.600006 ms - 41 - 4.087805 ms
[ SCALE] - 0.856000 ms - 1 - 0.856000 ms
[ CONT] - 0.718000 ms - 3 - 0.239333 ms
[ SOFT_MAX] - 2.402000 ms - 1 - 2.402000 ms
[ IM2COL] - 951.328003 ms - 39 - 24.393026 ms
[ UPSCALE] - 4.278000 ms - 3 - 1.426000 ms
[ UNARY] - 22.635000 ms - 29 - 0.780517 ms
Total Time: 1377.433960 ms My implementation:UNET model [ ADD] - 21.768999 ms - 419 - 0.051955 ms
[ MUL] - 8.439000 ms - 125 - 0.067512 ms
[ CONCAT] - 2.272000 ms - 12 - 0.189333 ms
[ NORM] - 2.565000 ms - 48 - 0.053438 ms
[GROUP_NORM] - 5.298000 ms - 61 - 0.086852 ms
[ MUL_MAT] - 120.456001 ms - 361 - 0.333673 ms
[ SCALE] - 2.285000 ms - 32 - 0.071406 ms
[ CONT] - 11.619000 ms - 160 - 0.072619 ms
[ SOFT_MAX] - 101.762001 ms - 32 - 3.180063 ms
[ IM2COL] - 64.297997 ms - 97 - 0.662866 ms
[ UPSCALE] - 0.560000 ms - 3 - 0.186667 ms
[ UNARY] - 4.590000 ms - 84 - 0.054643 ms
Total Time: 345.912994 ms AutoEncoder ========== CUDA Timings =========
[ ADD] - 65.053001 ms - 85 - 0.765329 ms
[ MUL] - 23.393000 ms - 30 - 0.779767 ms
[GROUP_NORM] - 56.570999 ms - 30 - 1.885700 ms
[ MUL_MAT] - 170.505997 ms - 41 - 4.158683 ms
[ SCALE] - 0.872000 ms - 1 - 0.872000 ms
[ CONT] - 0.731000 ms - 3 - 0.243667 ms
[ SOFT_MAX] - 2.396000 ms - 1 - 2.396000 ms
[ IM2COL] - 945.870972 ms - 39 - 24.253101 ms
[ UPSCALE] - 4.749000 ms - 3 - 1.583000 ms
[ UNARY] - 22.777000 ms - 29 - 0.785414 ms
Total Time: 1292.918945 ms |
This should be faster. If it isn't, we can add different versions of the kernels when broadcasting is not required. |
CUDA with latest changes, no improved performance, I have already run the same image generation several times, and the time remain the same. ========== CUDA Timings =========
[ ADD] - 40.991001 ms - 419 - 0.097831 ms
[ MUL] - 14.959000 ms - 125 - 0.119672 ms
[ CONCAT] - 2.246000 ms - 12 - 0.187167 ms
[ NORM] - 2.598000 ms - 48 - 0.054125 ms
[GROUP_NORM] - 5.277000 ms - 61 - 0.086508 ms
[ MUL_MAT] - 120.263000 ms - 361 - 0.333138 ms
[ SCALE] - 2.360000 ms - 32 - 0.073750 ms
[ CONT] - 11.507000 ms - 160 - 0.071919 ms
[ SOFT_MAX] - 102.699997 ms - 32 - 3.209375 ms
[ IM2COL] - 64.056999 ms - 97 - 0.660381 ms
[ UPSCALE] - 0.550000 ms - 3 - 0.183333 ms
[ UNARY] - 4.660000 ms - 84 - 0.055476 ms
Total Time: 372.167999 ms AutoEncoder ========== CUDA Timings =========
[ ADD] - 75.974998 ms - 85 - 0.893824 ms
[ MUL] - 28.681999 ms - 30 - 0.956067 ms
[GROUP_NORM] - 55.887001 ms - 30 - 1.862900 ms
[ MUL_MAT] - 166.731995 ms - 41 - 4.066634 ms
[ SCALE] - 0.884000 ms - 1 - 0.884000 ms
[ CONT] - 0.730000 ms - 3 - 0.243333 ms
[ SOFT_MAX] - 2.397000 ms - 1 - 2.397000 ms
[ IM2COL] - 948.924011 ms - 39 - 24.331385 ms
[ UPSCALE] - 4.275000 ms - 3 - 1.425000 ms
[ UNARY] - 22.627001 ms - 29 - 0.780241 ms
Total Time: 1307.113037 ms |
Ok, thanks for testing. I am adding a performance option to |
Updated the |
Any of the new ops that I can help with implementing for Metal? |
We need to implement these ops in Metal:
|
@ggerganov I am adding tests for quants and I am seeing some unexpected errors, even with just the CPU backend.
|
Looks like the The Metal backend fails every quant test, due to too much error. This doesn't happen with CUDA. |
The Yeah, the Metal matrix multiplications have NMSE of ~0.000015 (excluding those that are not supported -- should assert for them). Not sure why |
Does your Ubuntu machine support AVX512? My flags are AVX, AVX2, F16C, FMA. |
Building without CUDA solves the issue, I didn't realize that the matrices that I am testing with are big enough to enable automatic offloading. So this is a CUDA issue. |
Yes, same flags + sse3: [ 1%] Building C object src/CMakeFiles/ggml.dir/ggml.c.o
cd /home/ggerganov/development/github/ggml/build/src && /usr/bin/cc -DGGML_BUILD -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_EXPORTS -I/home/ggerganov/development/github/ggml/src/. -I/home/ggerganov/development/github/ggml/src/../include -I/home/ggerganov/development/github/ggml/src/../include/ggml -mavx -mavx2 -mfma -mf16c -msse3 -O3 -DNDEBUG -std=gnu11 -fPIC -Wall -Wpedantic -Wformat=2 -Wno-unused -Wstrict-prototypes -Werror=vla -Wunused -Wextra -Wcast-qual -Wdouble-promotion -Wshadow -Wno-unused-function -Wmissing-prototypes -MD -MT src/CMakeFiles/ggml.dir/ggml.c.o -MF CMakeFiles/ggml.dir/ggml.c.o.d -o CMakeFiles/ggml.dir/ggml.c.o -c /home/ggerganov/development/github/ggml/src/ggml.c
[ 2%] Building C object src/CMakeFiles/ggml.dir/ggml-alloc.c.o
cd /home/ggerganov/development/github/ggml/build/src && /usr/bin/cc -DGGML_BUILD -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_EXPORTS -I/home/ggerganov/development/github/ggml/src/. -I/home/ggerganov/development/github/ggml/src/../include -I/home/ggerganov/development/github/ggml/src/../include/ggml -mavx -mavx2 -mfma -mf16c -msse3 -O3 -DNDEBUG -std=gnu11 -fPIC -Wall -Wpedantic -Wformat=2 -Wno-unused -Wstrict-prototypes -Werror=vla -Wunused -Wextra -Wcast-qual -Wdouble-promotion -Wshadow -Wno-unused-function -Wmissing-prototypes -MD -MT src/CMakeFiles/ggml.dir/ggml-alloc.c.o -MF CMakeFiles/ggml.dir/ggml-alloc.c.o.d -o CMakeFiles/ggml.dir/ggml-alloc.c.o -c /home/ggerganov/development/github/ggml/src/ggml-alloc.c
[ 3%] Building C object src/CMakeFiles/ggml.dir/ggml-backend.c.o
cd /home/ggerganov/development/github/ggml/build/src && /usr/bin/cc -DGGML_BUILD -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_EXPORTS -I/home/ggerganov/development/github/ggml/src/. -I/home/ggerganov/development/github/ggml/src/../include -I/home/ggerganov/development/github/ggml/src/../include/ggml -mavx -mavx2 -mfma -mf16c -msse3 -O3 -DNDEBUG -std=gnu11 -fPIC -Wall -Wpedantic -Wformat=2 -Wno-unused -Wstrict-prototypes -Werror=vla -Wunused -Wextra -Wcast-qual -Wdouble-promotion -Wshadow -Wno-unused-function -Wmissing-prototypes -MD -MT src/CMakeFiles/ggml.dir/ggml-backend.c.o -MF CMakeFiles/ggml.dir/ggml-backend.c.o.d -o CMakeFiles/ggml.dir/ggml-backend.c.o -c /home/ggerganov/development/github/ggml/src/ggml-backend.c
[ 4%] Building C object src/CMakeFiles/ggml.dir/ggml-quants.c.o
cd /home/ggerganov/development/github/ggml/build/src && /usr/bin/cc -DGGML_BUILD -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_EXPORTS -I/home/ggerganov/development/github/ggml/src/. -I/home/ggerganov/development/github/ggml/src/../include -I/home/ggerganov/development/github/ggml/src/../include/ggml -mavx -mavx2 -mfma -mf16c -msse3 -O3 -DNDEBUG -std=gnu11 -fPIC -Wall -Wpedantic -Wformat=2 -Wno-unused -Wstrict-prototypes -Werror=vla -Wunused -Wextra -Wcast-qual -Wdouble-promotion -Wshadow -Wno-unused-function -Wmissing-prototypes -MD -MT src/CMakeFiles/ggml.dir/ggml-quants.c.o -MF CMakeFiles/ggml.dir/ggml-quants.c.o.d -o CMakeFiles/ggml.dir/ggml-quants.c.o -c /home/ggerganov/development/github/ggml/src/ggml-quants.c
[ 5%] Linking C shared library libggml.so
cd /home/ggerganov/development/github/ggml/build/src && /home/ggerganov/.local/lib/python3.10/site-packages/cmake/data/bin/cmake -E cmake_link_script CMakeFiles/ggml.dir/link.txt --verbose=1
/usr/bin/cc -fPIC -mavx -mavx2 -mfma -mf16c -msse3 -O3 -DNDEBUG -shared -Wl,-soname,libggml.so -o libggml.so CMakeFiles/ggml.dir/ggml.c.o "CMakeFiles/ggml.dir/ggml-alloc.c.o" "CMakeFiles/ggml.dir/ggml-backend.c.o" "CMakeFiles/ggml.dir/ggml-quants.c.o" -Wl,-rpath,:::::::::::::: -lm -ldl
make[2]: Leaving directory '/home/ggerganov/development/github/ggml/build'
|
@slaren I'm not exactly sure what ggml_mul_mat_id is for; it might be the indirect GEMM operation. If so, according to this paper, it seems like it could provide the possibility of optimizing convolution algorithms in terms of both performance and memory (avoiding the memory overhead of im2col), although I could be mistaken. |
The mul_mat_id is an indirect GEMM, but at the matrix level. So what this allows is choosing a matrix at run time based on the result of some operation. This paper seems to require indirection at the row level, so I don't think that we could use mul_mat_id to implement this. They mention that im2col is often used to take advantage of highly optimized BLAS libraries, but their algorithm doesn't seem compatible with a BLAS GEMM. |
After fixing the sizes of the matrices, the error with Metal is lower than CUDA in most cases, even with MMQ. Without MMQ it is probably because we do the mat mul in F16, and with MMQ it may be because CUDA quantizes src1 to q8_1 instead of q8_0. I guess we should just increase the NMSE threshold for mat mul. |
…, add mat-vec tests
I have changed the mul mat id implementation in ggml-cuda to force a synchronization to copy the id to the CPU and use the standard |
Yes, some CUDA refactoring is needed soon. I'm also planning a refactoring of the Metal backend. What do you think should be the next steps? Should we start integrating Btw, the |
I think the main blockers left to be able to use ggml-backend in llama.cpp are OpenCL support and CUDA split tensors. It shouldn't take too long to add these, but I don't think it is quite ready yet. |
I thinks this can be merged already if it is blocking #621 or anything else. I still want to add performance testing to |
ci/run.sh
Outdated
# TODO: disabled until we fix test-backend-ops | ||
#set -e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the tests are passing now, maybe this can be enabled again.
Ref: #621
The CPU implementation should be done, I will add CUDA and Metal next.