test-backend-ops : add performance eval mode + improve CUDA repeat and binary broadcast ops performance #636

slaren · 2023-12-05T21:57:14Z

Use with test-backend-ops perf [-o op] [-b backend].

Repeats the ops a number of times and calculates the memory throughput.

The memory transfer size of the op is given by the sum of the sizes of the destination tensor and all the sources, but it can be overriden by implementing op_size in the op test class
Binary ops with broadcasting calculate the size as ggml_nbytes(dst) * 3 to account for broadcasting
Matrix multiplication tries to calculate the size by considering all the memory accesses required for a standard $O(N^3)$ matrix multiplication
None of this takes into account cache effects, so it is possible to get a throughput higher than the system memory bandwidth
The number of repetitions (runs) per op depends on the memory size of the op. It tries to repeat the op enough times to get a total memory transfer of at least 8 GB for CPU, or 32 GB for GPU backends, with a maximum of 8192 repetitions.

slaren · 2023-12-05T22:05:36Z

The performance that I see when broadcasting with add/mul seems pretty good, so I am not sure what needs to be optimized. @FSSRepo can you give me some test cases for sd? Ie. the types and dimensions of the parameters.

  ADD(type=f32,ne=[16,10,1,1],nr=[1,1,1,1]):                               8192 runs -    10.28 us/run -        1 kB/run -    0.17 GB/s
  ADD(type=f32,ne=[16,10,10,1],nr=[1,1,1,1]):                              8192 runs -     3.28 us/run -       18 kB/run -    5.45 GB/s
  ADD(type=f32,ne=[16,10,10,10],nr=[1,1,1,1]):                             8192 runs -     4.05 us/run -      187 kB/run -   44.20 GB/s
  ADD(type=f32,ne=[16,10,10,10],nr=[2,1,1,1]):                             8192 runs -     4.04 us/run -      375 kB/run -   88.56 GB/s
  ADD(type=f32,ne=[16,10,10,10],nr=[1,2,1,1]):                             8192 runs -     4.90 us/run -      375 kB/run -   72.95 GB/s
  ADD(type=f32,ne=[16,10,10,10],nr=[1,1,2,1]):                             8192 runs -     4.77 us/run -      375 kB/run -   74.90 GB/s
  ADD(type=f32,ne=[16,10,10,10],nr=[1,1,1,2]):                             8192 runs -     4.52 us/run -      375 kB/run -   79.11 GB/s
  ADD(type=f32,ne=[16,10,10,10],nr=[1,1,2,2]):                             8192 runs -     5.84 us/run -      750 kB/run -  122.41 GB/s
  ADD(type=f32,ne=[16,10,10,10],nr=[1,2,2,2]):                             8192 runs -     8.34 us/run -     1500 kB/run -  171.61 GB/s
  ADD(type=f32,ne=[16,10,10,10],nr=[2,2,2,2]):                             8192 runs -     8.31 us/run -     3000 kB/run -  344.47 GB/s
  ADD(type=f32,ne=[4096,4096,1,1],nr=[1,1,1,1]):                            171 runs -   223.71 us/run -   196608 kB/run -  838.15 GB/s
  ADD(type=f32,ne=[4096,4096,1,1],nr=[2,1,1,1]):                             86 runs -   371.72 us/run -   393216 kB/run - 1008.82 GB/s
  ADD(type=f32,ne=[4096,4096,1,1],nr=[2,2,1,1]):                             43 runs -   738.05 us/run -   786432 kB/run - 1016.20 GB/s
  ADD(type=f32,ne=[4096,4096,1,1],nr=[2,2,2,1]):                             22 runs -  1472.18 us/run -  1572864 kB/run - 1018.90 GB/s
  ADD(type=f32,ne=[4096,4096,1,1],nr=[2,2,2,2]):                             11 runs -  2938.55 us/run -  3145728 kB/run - 1020.91 GB/s

FSSRepo · 2023-12-05T22:40:42Z

At the moment, I'm not at home. Could you wait for a few hours?

slaren · 2023-12-05T22:58:03Z

When you can, there is no rush at all.

FSSRepo · 2023-12-06T03:40:54Z

Some dimens by stable diffusion

add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 21 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 34 us
add: A[1280, 16, 16, 1] B[1280, 1, 1, 1] - 16 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 33 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 17 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 73 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 72 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 21 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 33 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 22 us
add: A[16, 16, 1280, 1] B[16, 16, 1280, 1] - 41 us
add: A[16, 16, 2560, 1] B[1, 1, 2560, 1] - 42 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 46 us
add: A[1280, 1, 1, 1] B[1280, 1, 1, 1] - 14 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 30 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 22 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 24 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 22 us
add: A[16, 16, 1280, 1] B[16, 16, 1280, 1] - 32 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 20 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 22 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 16 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 33 us
add: A[1280, 16, 16, 1] B[1280, 1, 1, 1] - 16 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 16 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 73 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 85 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 19 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 44 us
add: A[16, 16, 1280, 1] B[16, 16, 1280, 1] - 75 us
add: A[16, 16, 1920, 1] B[1, 1, 1920, 1] - 37 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 24 us
add: A[1280, 1, 1, 1] B[1280, 1, 1, 1] - 9 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 24 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 20 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 23 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 22 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 20 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 21 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 17 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 32 us
add: A[1280, 16, 16, 1] B[1280, 1, 1, 1] - 16 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 40 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 19 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 88 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 80 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 26 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 34 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 21 us
add: A[16, 16, 1280, 1] B[16, 16, 1280, 1] - 32 us
add: A[32, 32, 1280, 1] B[1, 1, 1280, 1] - 73 us
add: A[32, 32, 1920, 1] B[1, 1, 1920, 1] - 417 us
add: A[32, 32, 640, 1] B[1, 1, 640, 1] - 42 us
add: A[640, 1, 1, 1] B[640, 1, 1, 1] - 9 us
add: A[32, 32, 640, 1] B[1, 1, 640, 1] - 55 us
add: A[32, 32, 640, 1] B[1, 1, 640, 1] - 59 us
add: A[32, 32, 640, 1] B[1, 1, 640, 1] - 353 us

FSSRepo · 2023-12-06T14:03:52Z

@slaren I understand that ne= is the number of elements, but what is nr=?

The dimensions I provided are for the tensors a and b in the ggml_add function and their time in microseconds, using cuda.

slaren · 2023-12-06T14:08:18Z

It's the number of repetitions, so the dimensions of the tensor a are ne * nr and the dimensions of b are ne.

    ggml_tensor * a = ggml_new_tensor_4d(ctx, type, ne[0]*nr[0], ne[1]*nr[1], ne[2]*nr[2], ne[3]*nr[3]);
    ggml_tensor * b = ggml_new_tensor(ctx, type, 4, ne.data());
    ggml_tensor * out = op(ctx, a, b);

Where op is any binary op that supports broadcasting, like ggml_add or ggml_mul.

slaren · 2023-12-06T14:27:51Z

I get this with these test cases:

  ADD(type=f32,ne=[1280,1,1,1],nr=[1,1,1,1]):                    8192 runs -    10.27 us/run -       15 kB/run -    1.39 GB/s
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,16,16,1]):                  8192 runs -     5.95 us/run -     3840 kB/run -  615.30 GB/s
  ADD(type=f32,ne=[1280,16,16,1],nr=[1,1,1,1]):                  8192 runs -     5.97 us/run -     3840 kB/run -  613.33 GB/s
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,256,1,1]):                  8192 runs -     5.93 us/run -     3840 kB/run -  617.64 GB/s
  ADD(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]):                  8192 runs -    16.64 us/run -     3840 kB/run -  220.05 GB/s
  ADD(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]):                  8192 runs -    16.67 us/run -     3840 kB/run -  219.72 GB/s
  ADD(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]):                  5826 runs -    23.62 us/run -     5760 kB/run -  232.55 GB/s
  ADD(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]):                  4370 runs -    30.58 us/run -     7680 kB/run -  239.52 GB/s
  ADD(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]):                  2185 runs -    42.24 us/run -    15360 kB/run -  346.77 GB/s
  ADD(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]):                  1457 runs -    62.31 us/run -    23040 kB/run -  352.63 GB/s
  ADD(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]):                   4370 runs -    16.64 us/run -     7680 kB/run -  440.18 GB/s
  ADD(type=f32,ne=[5120,1,1,1],nr=[1,256,1,1]):                  2185 runs -    15.57 us/run -    15360 kB/run -  940.96 GB/s
  ADD(type=f32,ne=[640,1,1,1],nr=[1,1,1,1]):                     8192 runs -     3.15 us/run -        7 kB/run -    2.27 GB/s

FSSRepo · 2023-12-06T14:37:09Z

Seems good

slaren · 2023-12-06T15:11:35Z

Adjusting the block dims depending on the dimensions of the tensors improves some results:

  ADD(type=f32,ne=[1280,1,1,1],nr=[1,1,1,1]):                    8192 runs -   122.68 us/run -       15 kB/run -    0.12 GB/s
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,16,16,1]):                  8192 runs -     6.36 us/run -     3840 kB/run -  575.77 GB/s
  ADD(type=f32,ne=[1280,16,16,1],nr=[1,1,1,1]):                  8192 runs -     6.31 us/run -     3840 kB/run -  580.16 GB/s
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,256,1,1]):                  8192 runs -     6.34 us/run -     3840 kB/run -  577.96 GB/s
  ADD(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]):                  8192 runs -     5.96 us/run -     3840 kB/run -  614.85 GB/s
  ADD(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]):                  8192 runs -     5.87 us/run -     3840 kB/run -  623.86 GB/s
  ADD(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]):                  5826 runs -     7.20 us/run -     5760 kB/run -  763.11 GB/s
  ADD(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]):                  4370 runs -     8.67 us/run -     7680 kB/run -  845.24 GB/s
  ADD(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]):                  2185 runs -    16.32 us/run -    15360 kB/run -  897.71 GB/s
  ADD(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]):                  1457 runs -    22.88 us/run -    23040 kB/run -  960.41 GB/s
  ADD(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]):                   4370 runs -     8.69 us/run -     7680 kB/run -  843.31 GB/s
  ADD(type=f32,ne=[5120,1,1,1],nr=[1,256,1,1]):                  2185 runs -    16.23 us/run -    15360 kB/run -  902.41 GB/s

slaren · 2023-12-06T15:25:23Z

Processing two elements per thread:

  ADD(type=f32,ne=[1280,1,1,1],nr=[1,1,1,1]):                    8192 runs -    10.76 us/run -       15 kB/run -    1.33 GB/s
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,16,16,1]):                  8192 runs -     5.33 us/run -     3840 kB/run -  687.38 GB/s
  ADD(type=f32,ne=[1280,16,16,1],nr=[1,1,1,1]):                  8192 runs -     5.25 us/run -     3840 kB/run -  697.43 GB/s
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,256,1,1]):                  8192 runs -     5.32 us/run -     3840 kB/run -  688.97 GB/s
  ADD(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]):                  8192 runs -     5.21 us/run -     3840 kB/run -  703.33 GB/s
  ADD(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]):                  8192 runs -     4.89 us/run -     3840 kB/run -  749.53 GB/s
  ADD(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]):                  5826 runs -     5.68 us/run -     5760 kB/run -  966.92 GB/s
  ADD(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]):                  4370 runs -     6.53 us/run -     7680 kB/run - 1121.87 GB/s
  ADD(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]):                  2185 runs -    15.27 us/run -    15360 kB/run -  959.04 GB/s
  ADD(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]):                  1457 runs -    21.18 us/run -    23040 kB/run - 1037.30 GB/s
  ADD(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]):                   4370 runs -     6.51 us/run -     7680 kB/run - 1124.47 GB/s
  ADD(type=f32,ne=[5120,1,1,1],nr=[1,256,1,1]):                  2185 runs -    15.09 us/run -    15360 kB/run -  970.82 GB/s

This should be good enough already for these ops, they are usually a very small fraction of the overall time anyway.

FSSRepo · 2023-12-06T18:43:41Z

@slaren i will test your performance improvement in stable-diffusion

ggml-ci

ggerganov · 2023-12-06T19:45:05Z

Hm, whisper seems broken with CUDA atm?

Edit: whisper test that are currently failing:

diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index 8540ebd..abd4de6 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -1152,6 +1152,10 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op
     add_test_bin_bcast(GGML_TYPE_F32, {5120, 1, 1, 1}, {1, 256, 1, 1});
     add_test_bin_bcast(GGML_TYPE_F32, {640, 1, 1, 1}, {1, 1, 1, 1});
 
+    // whisper
+    add_test_bin_bcast(GGML_TYPE_F32, {1500, 512, 1, 1}, {1, 512, 1, 1});
+    add_test_bin_bcast(GGML_TYPE_F32, {3000, 512, 1, 1}, {1, 512, 1, 1});
+
     test_cases.emplace_back(new test_scale());
 
     for (float eps : {1e-6f, 1e-5f, 1e-3f, 1e-1f}) {

Edit2: whisper is now fixed, though these tests run OOM - not sure if expected

FSSRepo · 2023-12-06T19:57:29Z

@slaren with these dimensions in CUDA, it crashes (using stable diffusion, LoRA computing).

CUDA OP: ADD  A[1, 1, 320, 320] B[1, 1, 320, 320]

ggml-ci

slaren · 2023-12-06T20:17:16Z

Both issues should be fixed now.

slaren · 2023-12-06T20:19:13Z

whisper is now fixed, though these tests run OOM - not sure if expected

I don't think so, the tests shouldn't be so big as to cause OOM issues. When does that happen?

FSSRepo · 2023-12-06T20:20:36Z

CUDA OP: ADD [3, 3, 2560, 1280] [3, 3, 2560, 1280]
blocks num(1, 1, 78020)
block dim(1, 3, 42)

CUDA error 9 at C:\proyectos\stable-diffusion.cpp\ggml\src\ggml-cuda.cu:7238: invalid configuration argument
current device: 0

ggerganov · 2023-12-06T20:23:43Z

I don't think so, the tests shouldn't be so big as to cause OOM issues. When does that happen?

With that patch with the whisper tests, it fails like this:

make -j && ./bin/test-backend-ops -b CUDA0 -o ADD

Testing 2 backends

Backend 1/2 (CPU)
  Skipping
Backend 2/2 (CUDA0)
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1660, compute capability 7.5
  Backend name: CUDA
  ADD(type=f32,ne=[1,1,8,1],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[1,1,320,320],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[16,10,1,1],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[16,10,10,1],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[16,10,10,10],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[16,10,10,10],nr=[2,1,1,1]): OK
  ADD(type=f32,ne=[16,10,10,10],nr=[1,2,1,1]): OK
  ADD(type=f32,ne=[16,10,10,10],nr=[1,1,2,1]): OK
  ADD(type=f32,ne=[16,10,10,10],nr=[1,1,1,2]): OK
  ADD(type=f32,ne=[16,10,10,10],nr=[1,1,2,2]): OK
  ADD(type=f32,ne=[16,10,10,10],nr=[1,2,2,2]): OK
  ADD(type=f32,ne=[16,10,10,10],nr=[2,2,2,2]): OK
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,16,16,1]): OK
  ADD(type=f32,ne=[1280,16,16,1],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,256,1,1]): OK
  ADD(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]): OK
  ADD(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]): OK
  ADD(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]): OK
  ADD(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]): OK
  ADD(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]): OK
  ADD(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]): OK
  ADD(type=f32,ne=[5120,1,1,1],nr=[1,256,1,1]): OK
  ADD(type=f32,ne=[640,1,1,1],nr=[1,1,1,1]): OK

CUDA error 9 at /home/ggerganov/development/github/ggml/src/ggml-cuda.cu:6998: invalid configuration argument
current device: 0
GGML_ASSERT: /home/ggerganov/development/github/ggml/src/ggml-cuda.cu:6998: !"CUDA error"
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.

slaren · 2023-12-06T20:27:25Z

Is that nr correct though? That would cause 262144 rows. Still, the issue is that cuda limits grid sizes in dims z and y to 65536. So I guess we need to move the entire grid size to the x dim and compute the indices from that, which is a bit annoying.

FSSRepo · 2023-12-06T20:33:51Z

Is that nr correct though?

In this case CUDA OP: ADD [3, 3, 2560, 1280] [3, 3, 2560, 1280] nr should be [1, 1, 1, 1], and without using broadcasting.

ggerganov · 2023-12-06T20:39:59Z

Is that nr correct though?

No, it's not. I got confused about the meaning of the tests arguments - ignore these tests.

Not sure about @FSSRepo's case

FSSRepo · 2023-12-06T20:54:48Z

@slaren I think it could be checked if the shapes are the same, and the larger dimensions are swapped first, and the smaller ones last.

[3,3,2560,1280] -> [2560, 1280, 3, 3] in order to avoid the CUDA element limits, afterward, it is returned to the original order.

…ompute

…k kernel for large tensors ggml-ci

slaren · 2023-12-07T03:23:29Z

I tried a few different solutions with a single kernel, but it resulted in decreased performance, so instead I added a fallback kernel for large tensors. I also added dimension collapsing so that multi dimensional tensors are processed as 1d tensors when possible, and that may improve performance slightly in some cases.

ADD(type=f32,ne=[1,1,8,1],nr=[1,1,1,1]):          8192 runs -     3.37 us/run -        0 kB/run -    0.03 GB/s
ADD(type=f32,ne=[1,1,320,320],nr=[1,1,1,1]):      8192 runs -     3.86 us/run -     1200 kB/run -  296.47 GB/s
ADD(type=f32,ne=[16,10,1,1],nr=[1,1,1,1]):        8192 runs -     3.42 us/run -        1 kB/run -    0.52 GB/s
ADD(type=f32,ne=[16,10,10,1],nr=[1,1,1,1]):       8192 runs -     3.43 us/run -       18 kB/run -    5.22 GB/s
ADD(type=f32,ne=[16,10,10,10],nr=[1,1,1,1]):      8192 runs -     3.46 us/run -      187 kB/run -   51.72 GB/s
ADD(type=f32,ne=[16,10,10,10],nr=[2,1,1,1]):      8192 runs -     3.68 us/run -      375 kB/run -   97.16 GB/s
ADD(type=f32,ne=[16,10,10,10],nr=[1,2,1,1]):      8192 runs -     3.69 us/run -      375 kB/run -   96.85 GB/s
ADD(type=f32,ne=[16,10,10,10],nr=[1,1,2,1]):      8192 runs -     3.69 us/run -      375 kB/run -   97.04 GB/s
ADD(type=f32,ne=[16,10,10,10],nr=[1,1,1,2]):      8192 runs -     3.61 us/run -      375 kB/run -   99.07 GB/s
ADD(type=f32,ne=[16,10,10,10],nr=[1,1,2,2]):      8192 runs -     3.77 us/run -      750 kB/run -  189.94 GB/s
ADD(type=f32,ne=[16,10,10,10],nr=[1,2,2,2]):      8192 runs -     4.00 us/run -     1500 kB/run -  357.74 GB/s
ADD(type=f32,ne=[16,10,10,10],nr=[2,2,2,2]):      8192 runs -     4.59 us/run -     3000 kB/run -  623.11 GB/s
ADD(type=f32,ne=[1280,1,1,1],nr=[1,1,1,1]):       8192 runs -     3.31 us/run -       15 kB/run -    4.32 GB/s
ADD(type=f32,ne=[1280,1,1,1],nr=[1,16,16,1]):     8192 runs -     4.91 us/run -     3840 kB/run -  745.62 GB/s
ADD(type=f32,ne=[1280,16,16,1],nr=[1,1,1,1]):     8192 runs -     4.88 us/run -     3840 kB/run -  750.94 GB/s
ADD(type=f32,ne=[1280,1,1,1],nr=[1,256,1,1]):     8192 runs -     4.91 us/run -     3840 kB/run -  746.06 GB/s
ADD(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]):     8192 runs -     4.86 us/run -     3840 kB/run -  752.84 GB/s
ADD(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]):     8192 runs -     4.87 us/run -     3840 kB/run -  751.22 GB/s
ADD(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]):     5826 runs -     5.68 us/run -     5760 kB/run -  966.45 GB/s
ADD(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]):     4370 runs -     6.50 us/run -     7680 kB/run - 1127.64 GB/s
ADD(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]):     2185 runs -    15.25 us/run -    15360 kB/run -  960.36 GB/s
ADD(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]):     1457 runs -    21.13 us/run -    23040 kB/run - 1039.86 GB/s
ADD(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]):      4370 runs -     6.51 us/run -     7680 kB/run - 1125.93 GB/s
ADD(type=f32,ne=[5120,1,1,1],nr=[1,256,1,1]):     2185 runs -    15.09 us/run -    15360 kB/run -  970.90 GB/s
ADD(type=f32,ne=[640,1,1,1],nr=[1,1,1,1]):        8192 runs -     3.32 us/run -        7 kB/run -    2.16 GB/s
ADD(type=f32,ne=[3,3,2560,1280],nr=[1,1,1,1]):      98 runs -   392.74 us/run -   345600 kB/run -  839.20 GB/s
ADD(type=f32,ne=[3,3,2560,1280],nr=[2,1,1,1]):      49 runs -   860.69 us/run -   691200 kB/run -  765.87 GB/s

FSSRepo · 2023-12-07T03:54:04Z

@slaren

This implementation

[       ADD] - 22.503000 ms - 419 - 0.053706 ms
[       MUL] - 8.619000 ms - 125 - 0.068952 ms
[    CONCAT] - 2.235000 ms - 12 - 0.186250 ms
[      NORM] - 2.643000 ms - 48 - 0.055062 ms
[GROUP_NORM] - 5.309000 ms - 61 - 0.087033 ms
[   MUL_MAT] - 120.028000 ms - 361 - 0.332488 ms
[     SCALE] - 2.323000 ms - 32 - 0.072594 ms
[      CONT] - 11.394000 ms - 160 - 0.071213 ms
[  SOFT_MAX] - 101.511002 ms - 32 - 3.172219 ms
[    IM2COL] - 62.881001 ms - 97 - 0.648258 ms
[   UPSCALE] - 0.566000 ms - 3 - 0.188667 ms
[     UNARY] - 4.700000 ms - 84 - 0.055952 ms
Total Time: 344.712036 ms

My implementation

[       ADD] - 21.808001 ms - 419 - 0.052048 ms
[       MUL] - 8.414000 ms - 125 - 0.067312 ms
[    CONCAT] - 2.197000 ms - 12 - 0.183083 ms
[      NORM] - 2.564000 ms - 48 - 0.053417 ms
[GROUP_NORM] - 5.223000 ms - 61 - 0.085623 ms
[   MUL_MAT] - 117.841003 ms - 361 - 0.326429 ms
[     SCALE] - 2.251000 ms - 32 - 0.070344 ms
[      CONT] - 11.176000 ms - 160 - 0.069850 ms
[  SOFT_MAX] - 101.967003 ms - 32 - 3.186469 ms
[    IM2COL] - 62.348999 ms - 97 - 0.642773 ms
[   UPSCALE] - 0.544000 ms - 3 - 0.181333 ms
[     UNARY] - 4.892000 ms - 84 - 0.058238 ms
Total Time: 341.226013 ms

It seems that the performance is almost identical to my implementation, and it doesn't pose any issues. Everything is working well. Good job!

ggerganov · 2023-12-07T08:38:48Z

@slaren Planning to make a sync with llama.cpp/whisper.cpp after we merge this PR and ggerganov/llama.cpp#4309. Any concerns?

slaren · 2023-12-07T08:51:38Z

No, I don't expect any significant issues.

slaren added 3 commits December 5, 2023 22:14

ggml-cuda : implement repeat with bin_bcast

4dd5370

ggml-cuda : change supports_op for mul_mat to match compute_forward

0866d84

test-backend-ops : add performance eval mode

2699dd7

improve formatting

1f6c60d

add sd test cases

2ffe1a2

fix test case

5c74195

ggml-cuda : bin_bcast: better block sizes, two elements per thread

a5d7a1d

slaren marked this pull request as ready for review December 6, 2023 15:25

slaren changed the title ~~test-backend-ops : add performance eval mode~~ test-backend-ops : add performance eval mode + improve CUDA repeat and binary broadcast ops performance Dec 6, 2023

ggerganov approved these changes Dec 6, 2023

View reviewed changes

metal : add dim3 broadcast support for mul mat

69d844c

slaren and others added 4 commits December 6, 2023 20:17

cleanup

8eb145a

typo

171a091

metal : enable mul mat-vec for dim2 > 1

cca2854

metal : mul mat-vec support dim3 broadcasts

3ebcec1

ggml-ci

slaren added 2 commits December 6, 2023 21:06

ggml-cuda : fix bin_bcast for ne0=1

45fd1a2

ggml-ci

ggml-cuda : limit block size z dim to 64

d5f2fc9

test-backend-ops : add test cases

da6ca50

slaren added 3 commits December 7, 2023 03:58

test-backend-ops : add warmup run, print test type before trying to c…

ebdc505

…ompute

ggml-cuda : bin_bcast: collapse dimensions when possible, add fallbac…

fcdaaa2

…k kernel for large tensors ggml-ci

test-backend-ops : avoid division by zero

8ad1e85

slaren merged commit 990f931 into master Dec 7, 2023
4 checks passed

slaren deleted the test-backend-perf branch December 7, 2023 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test-backend-ops : add performance eval mode + improve CUDA repeat and binary broadcast ops performance #636

test-backend-ops : add performance eval mode + improve CUDA repeat and binary broadcast ops performance #636

slaren commented Dec 5, 2023 •

edited

Loading

slaren commented Dec 5, 2023 •

edited

Loading

FSSRepo commented Dec 5, 2023

slaren commented Dec 5, 2023

FSSRepo commented Dec 6, 2023 •

edited

Loading

FSSRepo commented Dec 6, 2023 •

edited

Loading

slaren commented Dec 6, 2023 •

edited

Loading

slaren commented Dec 6, 2023 •

edited

Loading

FSSRepo commented Dec 6, 2023 •

edited

Loading

slaren commented Dec 6, 2023

slaren commented Dec 6, 2023

FSSRepo commented Dec 6, 2023 •

edited

Loading

ggerganov commented Dec 6, 2023 •

edited

Loading

FSSRepo commented Dec 6, 2023 •

edited

Loading

slaren commented Dec 6, 2023

slaren commented Dec 6, 2023 •

edited

Loading

FSSRepo commented Dec 6, 2023 •

edited

Loading

ggerganov commented Dec 6, 2023

slaren commented Dec 6, 2023

FSSRepo commented Dec 6, 2023

ggerganov commented Dec 6, 2023

FSSRepo commented Dec 6, 2023 •

edited

Loading

slaren commented Dec 7, 2023 •

edited

Loading

FSSRepo commented Dec 7, 2023

ggerganov commented Dec 7, 2023

slaren commented Dec 7, 2023

test-backend-ops : add performance eval mode + improve CUDA repeat and binary broadcast ops performance #636

test-backend-ops : add performance eval mode + improve CUDA repeat and binary broadcast ops performance #636

Conversation

slaren commented Dec 5, 2023 • edited Loading

slaren commented Dec 5, 2023 • edited Loading

FSSRepo commented Dec 5, 2023

slaren commented Dec 5, 2023

FSSRepo commented Dec 6, 2023 • edited Loading

FSSRepo commented Dec 6, 2023 • edited Loading

slaren commented Dec 6, 2023 • edited Loading

slaren commented Dec 6, 2023 • edited Loading

FSSRepo commented Dec 6, 2023 • edited Loading

slaren commented Dec 6, 2023

slaren commented Dec 6, 2023

FSSRepo commented Dec 6, 2023 • edited Loading

ggerganov commented Dec 6, 2023 • edited Loading

FSSRepo commented Dec 6, 2023 • edited Loading

slaren commented Dec 6, 2023

slaren commented Dec 6, 2023 • edited Loading

FSSRepo commented Dec 6, 2023 • edited Loading

ggerganov commented Dec 6, 2023

slaren commented Dec 6, 2023

FSSRepo commented Dec 6, 2023

ggerganov commented Dec 6, 2023

FSSRepo commented Dec 6, 2023 • edited Loading

slaren commented Dec 7, 2023 • edited Loading

FSSRepo commented Dec 7, 2023

This implementation

My implementation

ggerganov commented Dec 7, 2023

slaren commented Dec 7, 2023

slaren commented Dec 5, 2023 •

edited

Loading

slaren commented Dec 5, 2023 •

edited

Loading

FSSRepo commented Dec 6, 2023 •

edited

Loading

FSSRepo commented Dec 6, 2023 •

edited

Loading

slaren commented Dec 6, 2023 •

edited

Loading

slaren commented Dec 6, 2023 •

edited

Loading

FSSRepo commented Dec 6, 2023 •

edited

Loading

FSSRepo commented Dec 6, 2023 •

edited

Loading

ggerganov commented Dec 6, 2023 •

edited

Loading

FSSRepo commented Dec 6, 2023 •

edited

Loading

slaren commented Dec 6, 2023 •

edited

Loading

FSSRepo commented Dec 6, 2023 •

edited

Loading

FSSRepo commented Dec 6, 2023 •

edited

Loading

slaren commented Dec 7, 2023 •

edited

Loading