-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added kronecker product support for dense matrices #2177
Conversation
Is anyone able to review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks like a good kernel implementation. I added a few comments.
It probably would have been nice to implement this in GPUArrays instead, so that other back-ends also get to benefit, but a CUDA.jl-only implementation is fine for now.
Did you benchmark the implementation? Computing the Kronecker product is very much memory bound, so it should be possible to get close to peak bandwidth, assuming the memory accesses are ordered correctly (maximizing coalescing, etc). I didn't pay close attention to that bit, it would probably be useful to run under NSight Compute for some insights.
The benchmarks returns four times faster calculation on my NVIDIA 4090, compared to my CPU AMD Ryzen Threadripper PRO 3975WX. I used the same matrices as in the test version. kron!(C, A, B)
@benchmark kron!($C, $A, $B)
BenchmarkTools.Trial: 275 samples with 6 evaluations.
Range (min … max): 5.480 μs … 3.549 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.520 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.030 ms ± 1.216 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▃ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▅
5.48 μs Histogram: log(frequency) by time 3.52 ms <
Memory estimate: 1.58 KiB, allocs estimate: 26. I was hoping for better results, but perhaps it is because of the relative small dimension of A (50 * 80 = 4000). |
Yeah, those are pretty small inputs. Try |
With Profiler ran for 2.33 ms, capturing 6 events.
Host-side activity: calling CUDA APIs took 208.85 µs (8.95% of the trace)
┌──────────┬────────────┬───────┬────────────────┐
│ Time (%) │ Total time │ Calls │ Name │
├──────────┼────────────┼───────┼────────────────┤
│ 1.64% │ 38.15 µs │ 1 │ cuLaunchKernel │
└──────────┴────────────┴───────┴────────────────┘
Device-side activity: GPU was busy for 2.07 ms (88.76% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Name │
├──────────┼────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 88.76% │ 2.07 ms │ 1 │ _Z18_kron_mat_kernelA_13CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE5Int64S1_S1_S1_ │
└──────────┴────────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘ With Profiler ran for 1.0 s, capturing 3494 events.
Host-side activity: calling CUDA APIs took 998.05 ms (99.77% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬──────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼──────────────────────────────────────┼──────────────────┤
│ 99.49% │ 995.3 ms │ 499 │ 1.99 ms ± 0.05 ( 1.96 ‥ 2.08) │ cuCtxSynchronize │
│ 0.21% │ 2.05 ms │ 499 │ 4.11 µs ± 13.02 ( 2.62 ‥ 224.59) │ cuLaunchKernel │
└──────────┴────────────┴───────┴──────────────────────────────────────┴──────────────────┘
Device-side activity: GPU was busy for 994.16 ms (99.38% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 99.38% │ 994.16 ms │ 499 │ 1.99 ms ± 0.05 ( 1.96 ‥ 2.07) │ _Z18_kron_mat_kernelA_13CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE5Int64S1_S1_S1_ │
└──────────┴────────────┴───────┴────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘
NVTX ranges:
┌──────────┬────────────┬───────┬───────────────────────────────────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼───────────────────────────────────┼─────────────────────┤
│ 99.98% │ 1.0 s │ 499 │ 2.0 ms ± 0.05 ( 1.97 ‥ 2.3) │ @bprofile.iteration │
└──────────┴────────────┴───────┴───────────────────────────────────┴─────────────────────┘ |
Ah interesting, so it is kernel bound. It would probably be good to run under NSight Compute then, to see if we aren't missing anything (or whether we're just properly memory bound here). |
Sorry, I won't have the time to optimize this, so I guess we can merge this as it is. |
No description provided.