Added kronecker product support for dense matrices #2177

albertomercurio · 2023-11-26T15:49:42Z

No description provided.

albertomercurio · 2023-12-03T14:17:37Z

Is anyone able to review this?

maleadt

Thanks, looks like a good kernel implementation. I added a few comments.

It probably would have been nice to implement this in GPUArrays instead, so that other back-ends also get to benefit, but a CUDA.jl-only implementation is fine for now.

Did you benchmark the implementation? Computing the Kronecker product is very much memory bound, so it should be possible to get close to peak bandwidth, assuming the memory accesses are ordered correctly (maximizing coalescing, etc). I didn't pay close attention to that bit, it would probably be useful to run under NSight Compute for some insights.

lib/cublas/linalg.jl

albertomercurio · 2023-12-04T10:24:24Z

The benchmarks returns four times faster calculation on my NVIDIA 4090, compared to my CPU AMD Ryzen Threadripper PRO 3975WX. I used the same matrices as in the test version.

kron!(C, A, B)

@benchmark kron!($C, $A, $B)

BenchmarkTools.Trial: 275 samples with 6 evaluations.
 Range (min … max):  5.480 μs … 3.549 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.520 ms             ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.030 ms ± 1.216 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃                                                       █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▅
  5.48 μs     Histogram: log(frequency) by time     3.52 ms <

 Memory estimate: 1.58 KiB, allocs estimate: 26.

I was hoping for better results, but perhaps it is because of the relative small dimension of A (50 * 80 = 4000).

maleadt · 2023-12-04T11:43:29Z

I was hoping for better results

Yeah, those are pretty small inputs. Try CUDA.@profile or CUDA.@bprofile for some more insights; it'll likely show how the actual kernel execution is much faster.

albertomercurio · 2023-12-05T14:21:01Z

With CUDA.@profile:

Profiler ran for 2.33 ms, capturing 6 events.

Host-side activity: calling CUDA APIs took 208.85 µs (8.95% of the trace)
┌──────────┬────────────┬───────┬────────────────┐
│ Time (%) │ Total time │ Calls │ Name           │
├──────────┼────────────┼───────┼────────────────┤
│    1.64% │   38.15 µs │     1 │ cuLaunchKernel │
└──────────┴────────────┴───────┴────────────────┘

Device-side activity: GPU was busy for 2.07 ms (88.76% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Name                                                                                                 │
├──────────┼────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   88.76% │    2.07 ms │     1 │ _Z18_kron_mat_kernelA_13CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE5Int64S1_S1_S1_ │
└──────────┴────────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘

With CUDA.@bprofile:

Profiler ran for 1.0 s, capturing 3494 events.

Host-side activity: calling CUDA APIs took 998.05 ms (99.77% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬──────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name             │
├──────────┼────────────┼───────┼──────────────────────────────────────┼──────────────────┤
│   99.49% │   995.3 ms │   499 │   1.99 ms ± 0.05   (  1.96 ‥ 2.08)   │ cuCtxSynchronize │
│    0.21% │    2.05 ms │   499 │   4.11 µs ± 13.02  (  2.62 ‥ 224.59) │ cuLaunchKernel   │
└──────────┴────────────┴───────┴──────────────────────────────────────┴──────────────────┘

Device-side activity: GPU was busy for 994.16 ms (99.38% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                                                                                                 │
├──────────┼────────────┼───────┼────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   99.38% │  994.16 ms │   499 │   1.99 ms ± 0.05   (  1.96 ‥ 2.07) │ _Z18_kron_mat_kernelA_13CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE5Int64S1_S1_S1_ │
└──────────┴────────────┴───────┴────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘

NVTX ranges:
┌──────────┬────────────┬───────┬───────────────────────────────────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                 │ Name                │
├──────────┼────────────┼───────┼───────────────────────────────────┼─────────────────────┤
│   99.98% │      1.0 s │   499 │    2.0 ms ± 0.05   (  1.97 ‥ 2.3) │ @bprofile.iteration │
└──────────┴────────────┴───────┴───────────────────────────────────┴─────────────────────┘

maleadt · 2023-12-11T13:53:03Z

Ah interesting, so it is kernel bound. It would probably be good to run under NSight Compute then, to see if we aren't missing anything (or whether we're just properly memory bound here).

maleadt · 2023-12-21T19:14:21Z

Sorry, I won't have the time to optimize this, so I guess we can merge this as it is.

albertomercurio added 2 commits November 26, 2023 16:48

Added dense kron function

d52ad76

Fixed num threads and blocks calculation

04a94c6

maleadt reviewed Dec 4, 2023

View reviewed changes

lib/cublas/linalg.jl Outdated Show resolved Hide resolved

lib/cublas/linalg.jl Show resolved Hide resolved

Introduced block number limit

a75205e

maleadt merged commit ed30f60 into JuliaGPU:master Dec 21, 2023
1 check passed

kshyatt pushed a commit that referenced this pull request Dec 27, 2023

Added kronecker product support for dense matrices (#2177)

3c42dc2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added kronecker product support for dense matrices #2177

Added kronecker product support for dense matrices #2177

albertomercurio commented Nov 26, 2023

albertomercurio commented Dec 3, 2023

maleadt left a comment

albertomercurio commented Dec 4, 2023

maleadt commented Dec 4, 2023

albertomercurio commented Dec 5, 2023

maleadt commented Dec 11, 2023

maleadt commented Dec 21, 2023

Added kronecker product support for dense matrices #2177

Added kronecker product support for dense matrices #2177

Conversation

albertomercurio commented Nov 26, 2023

albertomercurio commented Dec 3, 2023

maleadt left a comment

Choose a reason for hiding this comment

albertomercurio commented Dec 4, 2023

maleadt commented Dec 4, 2023

albertomercurio commented Dec 5, 2023

maleadt commented Dec 11, 2023

maleadt commented Dec 21, 2023