Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : full broadcast in mul, add, div + ggml_mul_mat_id, ggml_argsort, ggml_top_k #625

Merged
merged 45 commits into from
Dec 5, 2023

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Nov 30, 2023

Ref: #621

The CPU implementation should be done, I will add CUDA and Metal next.

@slaren slaren marked this pull request as ready for review December 1, 2023 00:10
@slaren
Copy link
Collaborator Author

slaren commented Dec 1, 2023

It should work now, but still need to test the performance.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whisper.cpp can go back to using a broadcasted ggml_add for dim 0 after this change. Will update it and check the performance later today

@FSSRepo
Copy link
Collaborator

FSSRepo commented Dec 1, 2023

I will test this in stable diffusion with cuda

@FSSRepo
Copy link
Collaborator

FSSRepo commented Dec 1, 2023

Works, but ... We will have to sacrifice some performance for the sake of maximizing compatibility.

Your implementation

UNET Model

========== CUDA Timings =========
[       ADD] - 38.222000 ms - 419 - 0.091222 ms
[       MUL] - 13.909000 ms - 125 - 0.111272 ms
[    CONCAT] - 2.308000 ms - 12 - 0.192333 ms
[      NORM] - 2.567000 ms - 48 - 0.053479 ms
[GROUP_NORM] - 5.310000 ms - 61 - 0.087049 ms
[   MUL_MAT] - 121.505997 ms - 361 - 0.336582 ms
[     SCALE] - 2.243000 ms - 32 - 0.070094 ms
[      CONT] - 11.947000 ms - 160 - 0.074669 ms
[  SOFT_MAX] - 102.802002 ms - 32 - 3.212563 ms
[    IM2COL] - 66.832001 ms - 97 - 0.688990 ms
[   UPSCALE] - 0.577000 ms - 3 - 0.192333 ms
[     UNARY] - 4.678000 ms - 84 - 0.055690 ms
Total Time: 372.901001 ms

AutoEncoder

========== CUDA Timings =========
[       ADD] - 123.287003 ms - 85 - 1.450435 ms
[       MUL] - 48.032001 ms - 30 - 1.601067 ms
[GROUP_NORM] - 56.298000 ms - 30 - 1.876600 ms
[   MUL_MAT] - 167.600006 ms - 41 - 4.087805 ms
[     SCALE] - 0.856000 ms - 1 - 0.856000 ms
[      CONT] - 0.718000 ms - 3 - 0.239333 ms
[  SOFT_MAX] - 2.402000 ms - 1 - 2.402000 ms
[    IM2COL] - 951.328003 ms - 39 - 24.393026 ms
[   UPSCALE] - 4.278000 ms - 3 - 1.426000 ms
[     UNARY] - 22.635000 ms - 29 - 0.780517 ms
Total Time: 1377.433960 ms

My implementation:

UNET model

[       ADD] - 21.768999 ms - 419 - 0.051955 ms
[       MUL] - 8.439000 ms - 125 - 0.067512 ms
[    CONCAT] - 2.272000 ms - 12 - 0.189333 ms
[      NORM] - 2.565000 ms - 48 - 0.053438 ms
[GROUP_NORM] - 5.298000 ms - 61 - 0.086852 ms
[   MUL_MAT] - 120.456001 ms - 361 - 0.333673 ms
[     SCALE] - 2.285000 ms - 32 - 0.071406 ms
[      CONT] - 11.619000 ms - 160 - 0.072619 ms
[  SOFT_MAX] - 101.762001 ms - 32 - 3.180063 ms
[    IM2COL] - 64.297997 ms - 97 - 0.662866 ms
[   UPSCALE] - 0.560000 ms - 3 - 0.186667 ms
[     UNARY] - 4.590000 ms - 84 - 0.054643 ms
Total Time: 345.912994 ms

AutoEncoder

========== CUDA Timings =========
[       ADD] - 65.053001 ms - 85 - 0.765329 ms
[       MUL] - 23.393000 ms - 30 - 0.779767 ms
[GROUP_NORM] - 56.570999 ms - 30 - 1.885700 ms
[   MUL_MAT] - 170.505997 ms - 41 - 4.158683 ms
[     SCALE] - 0.872000 ms - 1 - 0.872000 ms
[      CONT] - 0.731000 ms - 3 - 0.243667 ms
[  SOFT_MAX] - 2.396000 ms - 1 - 2.396000 ms
[    IM2COL] - 945.870972 ms - 39 - 24.253101 ms
[   UPSCALE] - 4.749000 ms - 3 - 1.583000 ms
[     UNARY] - 22.777000 ms - 29 - 0.785414 ms
Total Time: 1292.918945 ms

@slaren
Copy link
Collaborator Author

slaren commented Dec 1, 2023

This should be faster. If it isn't, we can add different versions of the kernels when broadcasting is not required.

@slaren slaren changed the title ggml : support broadcasting in dim 0 in add and mul ggml : full broadcast in mul, add, div + ggml_mul_mat_id, ggml_sort, ggml_top_k, Dec 2, 2023
@slaren slaren changed the title ggml : full broadcast in mul, add, div + ggml_mul_mat_id, ggml_sort, ggml_top_k, ggml : full broadcast in mul, add, div + ggml_mul_mat_id, ggml_sort, ggml_top_k Dec 2, 2023
@slaren slaren changed the title ggml : full broadcast in mul, add, div + ggml_mul_mat_id, ggml_sort, ggml_top_k ggml : full broadcast in mul, add, div + ggml_mul_mat_id, ggml_argsort, ggml_top_k Dec 2, 2023
@FSSRepo
Copy link
Collaborator

FSSRepo commented Dec 3, 2023

CUDA with latest changes, no improved performance, I have already run the same image generation several times, and the time remain the same.

========== CUDA Timings =========
[       ADD] - 40.991001 ms - 419 - 0.097831 ms
[       MUL] - 14.959000 ms - 125 - 0.119672 ms
[    CONCAT] - 2.246000 ms - 12 - 0.187167 ms
[      NORM] - 2.598000 ms - 48 - 0.054125 ms
[GROUP_NORM] - 5.277000 ms - 61 - 0.086508 ms
[   MUL_MAT] - 120.263000 ms - 361 - 0.333138 ms
[     SCALE] - 2.360000 ms - 32 - 0.073750 ms
[      CONT] - 11.507000 ms - 160 - 0.071919 ms
[  SOFT_MAX] - 102.699997 ms - 32 - 3.209375 ms
[    IM2COL] - 64.056999 ms - 97 - 0.660381 ms
[   UPSCALE] - 0.550000 ms - 3 - 0.183333 ms
[     UNARY] - 4.660000 ms - 84 - 0.055476 ms
Total Time: 372.167999 ms

AutoEncoder

========== CUDA Timings =========
[       ADD] - 75.974998 ms - 85 - 0.893824 ms
[       MUL] - 28.681999 ms - 30 - 0.956067 ms
[GROUP_NORM] - 55.887001 ms - 30 - 1.862900 ms
[   MUL_MAT] - 166.731995 ms - 41 - 4.066634 ms
[     SCALE] - 0.884000 ms - 1 - 0.884000 ms
[      CONT] - 0.730000 ms - 3 - 0.243333 ms
[  SOFT_MAX] - 2.397000 ms - 1 - 2.397000 ms
[    IM2COL] - 948.924011 ms - 39 - 24.331385 ms
[   UPSCALE] - 4.275000 ms - 3 - 1.425000 ms
[     UNARY] - 22.627001 ms - 29 - 0.780241 ms
Total Time: 1307.113037 ms

@slaren
Copy link
Collaborator Author

slaren commented Dec 3, 2023

Ok, thanks for testing. I am adding a performance option to test-backend-ops that will make easier testing the performance of the kernels.

@ggerganov
Copy link
Owner

Updated the whisper example to use the broadcasted ggml_add and I don't observe any performance difference both with Metal and CUDA compared to the pre-broadcasted version that we had. So I think it is good

@ggerganov
Copy link
Owner

Any of the new ops that I can help with implementing for Metal?

@slaren
Copy link
Collaborator Author

slaren commented Dec 4, 2023

We need to implement these ops in Metal:

  • ggml_div
  • ggml_sum_rows
  • ggml_argsort
  • ggml_mul_mat_id

src/ggml-cuda.cu Outdated Show resolved Hide resolved
@slaren
Copy link
Collaborator Author

slaren commented Dec 4, 2023

@ggerganov I am adding tests for quants and I am seeing some unexpected errors, even with just the CPU backend.

q5_0 and q5_1 generate some UB warnings:

/home/diego/code/ggml/src/ggml.c:17957:48: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
/home/diego/code/ggml/src/ggml.c:17957:62: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
  MUL_MAT(type_a=q5_0,type_b=f32,m=32,n=32,k=256,bs=[1,1],nr=[1,1]): OK
  ome/diego/code/ggml/src/ggml.c:17987:48: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
/home/diego/code/ggml/src/ggml.c:17987:62: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
  MUL_MAT(type_a=q5_1,type_b=f32,m=32,n=32,k=256,bs=[1,1],nr=[1,1]): OK

q2_k and q3_k generate NaNs:

  Error: MUL_MAT: NaN at index 6175
  MUL_MAT(type_a=q2_K,type_b=f32,m=32,n=32,k=256,bs=[10,1],nr=[1,1]): FAIL
  MUL_MAT(type_a=q2_K,type_b=f32,m=32,n=32,k=256,bs=[10,1],nr=[2,1]): OK
    Error: MUL_MAT: NaN at index 26655
  MUL_MAT(type_a=q2_K,type_b=f32,m=32,n=32,k=256,bs=[10,10],nr=[1,1]): FAIL
    Error: MUL_MAT: NaN at index 95263
  MUL_MAT(type_a=q2_K,type_b=f32,m=32,n=32,k=256,bs=[10,10],nr=[2,1]): FAIL
  MUL_MAT(type_a=q2_K,type_b=f32,m=32,n=32,k=256,bs=[10,10],nr=[1,2]): OK
    Error: MUL_MAT: NaN at index 35871
  MUL_MAT(type_a=q2_K,type_b=f32,m=32,n=32,k=256,bs=[10,10],nr=[2,2]): FAIL
  MUL_MAT(type_a=q3_K,type_b=f32,m=32,n=32,k=256,bs=[1,1],nr=[1,1]): OK
  MUL_MAT(type_a=q3_K,type_b=f32,m=32,n=32,k=256,bs=[10,1],nr=[1,1]): OK
  MUL_MAT(type_a=q3_K,type_b=f32,m=32,n=32,k=256,bs=[10,1],nr=[2,1]): OK
    Error: MUL_MAT: NaN at index 26655
  MUL_MAT(type_a=q3_K,type_b=f32,m=32,n=32,k=256,bs=[10,10],nr=[1,1]): FAIL
    Error: MUL_MAT: NaN at index 59423
  MUL_MAT(type_a=q3_K,type_b=f32,m=32,n=32,k=256,bs=[10,10],nr=[2,1]): FAIL
  MUL_MAT(type_a=q3_K,type_b=f32,m=32,n=32,k=256,bs=[10,10],nr=[1,2]): OK
    Error: MUL_MAT: NaN at index 208927
  MUL_MAT(type_a=q3_K,type_b=f32,m=32,n=32,k=256,bs=[10,10],nr=[2,2]): FAIL

@slaren
Copy link
Collaborator Author

slaren commented Dec 4, 2023

Looks like the q2_k and q3_k NaNs don't happen in the M3, so I guess this is an issue with the AVX implementation.

The Metal backend fails every quant test, due to too much error. This doesn't happen with CUDA.

@ggerganov
Copy link
Owner

The Q2_K and Q3_K NaNs do not occur on my Ubuntu machine with AVX

Yeah, the Metal matrix multiplications have NMSE of ~0.000015 (excluding those that are not supported -- should assert for them). Not sure why

@slaren
Copy link
Collaborator Author

slaren commented Dec 4, 2023

Does your Ubuntu machine support AVX512? My flags are AVX, AVX2, F16C, FMA.

@slaren
Copy link
Collaborator Author

slaren commented Dec 4, 2023

Building without CUDA solves the issue, I didn't realize that the matrices that I am testing with are big enough to enable automatic offloading. So this is a CUDA issue.

@ggerganov
Copy link
Owner

Yes, same flags + sse3:

[  1%] Building C object src/CMakeFiles/ggml.dir/ggml.c.o
cd /home/ggerganov/development/github/ggml/build/src && /usr/bin/cc -DGGML_BUILD -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_EXPORTS -I/home/ggerganov/development/github/ggml/src/. -I/home/ggerganov/development/github/ggml/src/../include -I/home/ggerganov/development/github/ggml/src/../include/ggml -mavx -mavx2 -mfma -mf16c -msse3 -O3 -DNDEBUG -std=gnu11 -fPIC -Wall -Wpedantic -Wformat=2 -Wno-unused -Wstrict-prototypes -Werror=vla -Wunused -Wextra -Wcast-qual -Wdouble-promotion -Wshadow -Wno-unused-function -Wmissing-prototypes -MD -MT src/CMakeFiles/ggml.dir/ggml.c.o -MF CMakeFiles/ggml.dir/ggml.c.o.d -o CMakeFiles/ggml.dir/ggml.c.o -c /home/ggerganov/development/github/ggml/src/ggml.c
[  2%] Building C object src/CMakeFiles/ggml.dir/ggml-alloc.c.o
cd /home/ggerganov/development/github/ggml/build/src && /usr/bin/cc -DGGML_BUILD -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_EXPORTS -I/home/ggerganov/development/github/ggml/src/. -I/home/ggerganov/development/github/ggml/src/../include -I/home/ggerganov/development/github/ggml/src/../include/ggml -mavx -mavx2 -mfma -mf16c -msse3 -O3 -DNDEBUG -std=gnu11 -fPIC -Wall -Wpedantic -Wformat=2 -Wno-unused -Wstrict-prototypes -Werror=vla -Wunused -Wextra -Wcast-qual -Wdouble-promotion -Wshadow -Wno-unused-function -Wmissing-prototypes -MD -MT src/CMakeFiles/ggml.dir/ggml-alloc.c.o -MF CMakeFiles/ggml.dir/ggml-alloc.c.o.d -o CMakeFiles/ggml.dir/ggml-alloc.c.o -c /home/ggerganov/development/github/ggml/src/ggml-alloc.c
[  3%] Building C object src/CMakeFiles/ggml.dir/ggml-backend.c.o
cd /home/ggerganov/development/github/ggml/build/src && /usr/bin/cc -DGGML_BUILD -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_EXPORTS -I/home/ggerganov/development/github/ggml/src/. -I/home/ggerganov/development/github/ggml/src/../include -I/home/ggerganov/development/github/ggml/src/../include/ggml -mavx -mavx2 -mfma -mf16c -msse3 -O3 -DNDEBUG -std=gnu11 -fPIC -Wall -Wpedantic -Wformat=2 -Wno-unused -Wstrict-prototypes -Werror=vla -Wunused -Wextra -Wcast-qual -Wdouble-promotion -Wshadow -Wno-unused-function -Wmissing-prototypes -MD -MT src/CMakeFiles/ggml.dir/ggml-backend.c.o -MF CMakeFiles/ggml.dir/ggml-backend.c.o.d -o CMakeFiles/ggml.dir/ggml-backend.c.o -c /home/ggerganov/development/github/ggml/src/ggml-backend.c
[  4%] Building C object src/CMakeFiles/ggml.dir/ggml-quants.c.o
cd /home/ggerganov/development/github/ggml/build/src && /usr/bin/cc -DGGML_BUILD -DGGML_SHARED -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -Dggml_EXPORTS -I/home/ggerganov/development/github/ggml/src/. -I/home/ggerganov/development/github/ggml/src/../include -I/home/ggerganov/development/github/ggml/src/../include/ggml -mavx -mavx2 -mfma -mf16c -msse3 -O3 -DNDEBUG -std=gnu11 -fPIC -Wall -Wpedantic -Wformat=2 -Wno-unused -Wstrict-prototypes -Werror=vla -Wunused -Wextra -Wcast-qual -Wdouble-promotion -Wshadow -Wno-unused-function -Wmissing-prototypes -MD -MT src/CMakeFiles/ggml.dir/ggml-quants.c.o -MF CMakeFiles/ggml.dir/ggml-quants.c.o.d -o CMakeFiles/ggml.dir/ggml-quants.c.o -c /home/ggerganov/development/github/ggml/src/ggml-quants.c
[  5%] Linking C shared library libggml.so
cd /home/ggerganov/development/github/ggml/build/src && /home/ggerganov/.local/lib/python3.10/site-packages/cmake/data/bin/cmake -E cmake_link_script CMakeFiles/ggml.dir/link.txt --verbose=1
/usr/bin/cc -fPIC  -mavx -mavx2 -mfma -mf16c -msse3 -O3 -DNDEBUG -shared -Wl,-soname,libggml.so -o libggml.so CMakeFiles/ggml.dir/ggml.c.o "CMakeFiles/ggml.dir/ggml-alloc.c.o" "CMakeFiles/ggml.dir/ggml-backend.c.o" "CMakeFiles/ggml.dir/ggml-quants.c.o"  -Wl,-rpath,:::::::::::::: -lm -ldl 
make[2]: Leaving directory '/home/ggerganov/development/github/ggml/build'

AMD Ryzen 9 5950X 16-Core Processor

@FSSRepo
Copy link
Collaborator

FSSRepo commented Dec 4, 2023

@slaren I'm not exactly sure what ggml_mul_mat_id is for; it might be the indirect GEMM operation. If so, according to this paper, it seems like it could provide the possibility of optimizing convolution algorithms in terms of both performance and memory (avoiding the memory overhead of im2col), although I could be mistaken.

@slaren
Copy link
Collaborator Author

slaren commented Dec 4, 2023

The mul_mat_id is an indirect GEMM, but at the matrix level. So what this allows is choosing a matrix at run time based on the result of some operation. This paper seems to require indirection at the row level, so I don't think that we could use mul_mat_id to implement this. They mention that im2col is often used to take advantage of highly optimized BLAS libraries, but their algorithm doesn't seem compatible with a BLAS GEMM.

@slaren
Copy link
Collaborator Author

slaren commented Dec 4, 2023

After fixing the sizes of the matrices, the error with Metal is lower than CUDA in most cases, even with MMQ. Without MMQ it is probably because we do the mat mul in F16, and with MMQ it may be because CUDA quantizes src1 to q8_1 instead of q8_0. I guess we should just increase the NMSE threshold for mat mul.

@slaren
Copy link
Collaborator Author

slaren commented Dec 5, 2023

I have changed the mul mat id implementation in ggml-cuda to force a synchronization to copy the id to the CPU and use the standard ggml_cuda_mul_mat. It is probably going to add a bit of latency compared to making the choice entirely on the GPU, but it is a much simpler implementation. I think we need to do some refactoring of ggml-cuda before we can replicate the entire behavior of ggml_cuda_mul_mat on the device side.

@ggerganov
Copy link
Owner

Yes, some CUDA refactoring is needed soon. I'm also planning a refactoring of the Metal backend.

What do you think should be the next steps? Should we start integrating ggml_backend into llama.cpp or are there more things that need to be done first?

Btw, the test-backend-ops is really awesome! I see you are planning to also add performance stats

@slaren
Copy link
Collaborator Author

slaren commented Dec 5, 2023

I think the main blockers left to be able to use ggml-backend in llama.cpp are OpenCL support and CUDA split tensors. It shouldn't take too long to add these, but I don't think it is quite ready yet.

@slaren
Copy link
Collaborator Author

slaren commented Dec 5, 2023

I thinks this can be merged already if it is blocking #621 or anything else. I still want to add performance testing to test-backend-ops and add optimized kernels for add/mul/div when no broadcasting is required, but that can be done in another PR.

ci/run.sh Outdated
Comment on lines 92 to 93
# TODO: disabled until we fix test-backend-ops
#set -e
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the tests are passing now, maybe this can be enabled again.

@slaren slaren merged commit 703825f into master Dec 5, 2023
4 checks passed
@slaren slaren deleted the mul-add-bcast branch December 5, 2023 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants